Clustering Techniques for Revealing Gene Expression Patterns

Clustering Techniques for Revealing Gene Expression Patterns

Crescenzio Gallo, Vito Capozzi
Copyright: © 2015 |Pages: 10
DOI: 10.4018/978-1-4666-5888-2.ch042
(Individual Chapters)
No Current Special Offers

Chapter Preview



The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Molecular biologists need robust computational tools to determine models that can learn to recognize DNA and amino acid sequences and assign protein structures to certain sequences. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this Chapter we describe the main clustering algorithms developed for analyzing gene expression data, comparing their results with the classification deriving by the application of unsupervised neural networks.

In the analysis of gene expression data of particular interest is the search for correlated patterns, which is typically done by clustering analysis. DNA microarray technologies (Lockhart et al., 1996) allow the monitoring of thousand genes quickly and efficiently. These technologies have introduced new rules for the exploration of an organism with a genome wide-ranging vision. In particular, the study of gene expression of a complete genome (such as that of Saccharomyces cerevisiae) is now possible. Studies have also been developed (Perou et al., 1999) through the use of DNA microarrays until the complete mapping of the human genome. The production of targeted drugs and identification of drugs are other areas that can significantly benefit from these techniques.

One problem inherent the use of DNA microarray technology is the huge amount of data available, the analysis of which is a significant problem per se. Several approaches are used in the analysis of gene expression data, grouped in two areas: clustering and classification. Clustering is a purely data-driven activity that uses only data from the study or experiment to group together measurements. Classification, in contrast, uses additional data, including heuristics, to assign measurements to groups. Among these, commonly statistical methods applied to microarray data are Hierarchical Clustering (Sneath & Sokal, 1973) and (Unsupervised) Neural Networks (Herrero et al., 2001): The identification of the optimal method for the analysis of these data is still a topic of discussion.

In this Chapter we examine some methods for gene co-expression analysis, such as “correlation graphs” and supervised-unsupervised clustering methods. The next section is a brief exposition of the underlying background of clustering techniques. Then we detail the clustering algorithm based on correlation graphs. Next we examine the application of supervised and unsupervised techniques. The Chapter ends with some final considerations and further research directions.

Key Terms in this Chapter

Artificial Neural Network: Mathematical models that represent the interconnection between elements defined artificial neurons, i.e. mathematical constructs that to some extent mimic the properties of living neurons. These mathematical models can be used both to obtain an understanding of biological neural networks, but even more to solve engineering problems of artificial intelligence such as those that arise in various technological fields (in electronics, computer science, simulation, and other disciplines).

DNA Microarray: (Commonly known as gene chip, DNA chip, biochip array or high density) A collection of microscopic DNA probes attached to a solid surface such as glass, plastic, or silicon chip forming an array (matrix). Such arrays allow to simultaneously examine the presence of many genes within a DNA sample (which often can also represent the entire genome or transcriptome of an organism). A typical use is to compare the gene expression profile of an individual patient with that of a healthy one to identify which genes are involved in the disease.

Pattern: In biology with pattern (sometimes “profile”) one refers to different types of regularity, such as the regularity of the biological sequences of DNA or proteins that allow the recognition and specific binding between molecules, or the regularity in the level of expression of the genes of cells which allow the recognition of different experimental cell types including tumor cell types, or the regularity in the events that occur during processes such as the development of an organism, or even the regularities in the behavior of animals.

Algorithm: A procedure that solves a given problem by a finite number of steps. A problem solved by an algorithm is said computable . The term “algorithm” is derived from the Latin transcription of the name of the Persian mathematician al-Khwarizmi, which is considered one of the first authors to have made reference to this concept.

Gene: The fundamental hereditary unit of living organisms. Genes correspond to portions of the genetic code localized in specific positions within the sequence (DNA or, more rarely, RNA) and contain all the information necessary for the production of a protein. They are contained and organized within chromosomes, present in all cells of an organism.

Deoxyribonucleic Acid (DNA): A nucleic acid that contains the genetic information necessary to the biosynthesis of RNA and protein molecules essential for the development and proper functioning of most living organisms. The order in the sequential arrangement of the nucleotides A, T, C, G represents the genetic information, which is translated with the genetic code in the corresponding amino acids.

Clustering: A set of techniques of multivariate data analysis aimed at selecting and grouping homogeneous elements in a data set. Clustering techniques are based on measures relating to the similarity between the elements. In many approaches this similarity, or better, dissimilarity, is designed in terms of distance in a multidimensional space. Clustering algorithms group items on the basis of their mutual distance, and then the belonging to a set or not depends on how the element under consideration is distant from the collection itself.

Gene Expression: The measure of the activity (expression) of thousands of genes at a time, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are in proliferation, or show how the cells react to a particular treatment. Many experiments of this type measure an entire genome simultaneously. DNA Microarray technology measures the relative activity of target genes previously identified.

Cluster: Natural subgroup of a population, used for statistical sampling or analysis.

Bioinformatics: A scientific discipline devoted to the solution of biological problems at the molecular level with computer methods. It is an attempt to describe, in numerical and statistical terms, biological phenomena with a set of analytical and numerical tools. In addition to information technology, bioinformatics uses applied mathematics, statistics, chemistry, biochemistry and concepts of artificial intelligence. Bioinformatics mainly deals with: (1) Providing valid statistical models for the interpretation of data from experiments in molecular biology and biochemistry in order to identify trends and numerical laws; (2) Generate new models and mathematical tools for the analysis of sequences of DNA, RNA and proteins in order to create a body of knowledge concerning the frequency of relevant sequences, their evolution and possible function; (3) Organize the knowledge acquired at the global level of genome and proteome databases in order to make such data accessible to all, and to optimize the data search algorithms to improve accessibility.

Complete Chapter List

Search this Book: