This chapter introduces the techniques that have been used to identify the genetic regulatory modules by integrating data from various sources. Data relating to the functioning of individual genes can be drawn from many different and diverse experimental techniques. Each piece of data provides information on a specific aspect of the cell regulation process. The chapter argues that integration of these diverse types of data is essential in order to identify biologically relevant regulatory modules. A concise review of the different integration techniques is presented, together with a critical discussion of their pros and cons. A very large number of research papers have been published on this topic, and the authors hope that this chapter will present the reader with a high-level view of the area, elucidating the research issues and underlining the importance of data integration in modern bioinformatics.
A network of transcription factors regulating transcription factors or other proteins is called a transcriptional regulatory network or gene regulatory network. The understanding and reconstruction of this regulation process at a global level is one of the major challenges for the nascent field of bio-informatics (Schlkopf et al., 2004).
Considerable work has been done by molecular biologists over the last few years in identifying the functions of specific genes. In an ideal world it would be desirable to apply these results in order to build detailed models of regulation where the precise action of each gene is understood. However, large number of genes and the complexity of the regulation process means that this approach has not been feasible. Research into discovering causal models based on the actions of individual genes has encountered a major difficulty in estimating a large number of parameters from a paucity of experimental data. Fortunately however, biological organisation opens up the possibility of modelling at a less detailed level. In nature, complex functions of living cells are carried out through the concerted activities of many genes and gene products which are organized into co-regulated sets also known as regulatory modules (Segal et al., 2003). Understanding the organization of these sets of genes will provide insights into the cellular response mechanism under various conditions. Recently a considerable volume of data on gene activity, measured using several diverse techniques, has become widely available. By fusing this data using an integrative approach, we can try to unravel the regulation process at a more global level. Although an integrated model could never be as precise as one built from a small number of genes in controlled conditions, such global modelling can provide insights into higher processes where many genes are working together to achieve a task. Various techniques from statistics, machine learning and computer science have been employed by researchers for the analysis and combination of the different types of data in an attempt to identify and understand the function of regulatory modules.
There are two underlying problems resulting from the nature of the available data. Firstly, each of the different data types (microarray, dna-binding, protein-protein interaction and sequence data) provides a partial and noisy picture of the whole process. They need to be integrated in order to obtain an improved and reliable picture of the whole underlying process. Secondly, the amount of data that is available from each of these techniques is severely limited. To learn good models we need lots of data, yet data is only available for few experiments of each type. To alleviate this problem many researchers have taken the path of merging all available datasets before carrying out an analysis. Thus there can be some confusion regarding the term integrative because it has been used to describe both of these two very different approaches to data integration: one among datasets of the same type, for example microarrays, but from different experiments, and the other among different types of data, for example microarray and DNA binding data.
In the rest of the chapter we will describe various techniques proposed to carry out both of these types of integration and will discuss their pros and cons. We will review some of the prominent research following the former approach by Ihmels et al. (2002) and Segal et al. (2005), and work following the latter approach by Bar-Jospeh et al.(2003), Tanay at al. (2004, 2005) and Lemmens et al. (2006).
Key Terms in this Chapter
Protein-Protein Interaction: describes the interaction between different protein molecules which are of central importance for virtually every process in a living cell. Since proteins are gene products, these interactions when studied along with gene expression data, provide a better understanding of the underlying processes.
K-Means Clustering: is an algorithm to group (cluster) objects based on certain attributes into a pre-determined number (K) of groups or clusters. The grouping is done by minimizing the sum of squares of distances between individual data and the corresponding cluster centre which is calculated by averaging all the data within the cluster. It is an iterative procedure that refines the groupings in multiple steps each improving the cluster quality.
Chromatin Immunoprecipitation: also popularly known as ChIP, is an experimental method to determine whether proteins (e.g. transcription factors) bind to certain regions of cells. When used with microarrays, the technique is known as ChIP-chip, and is used to identify the binding of proteins on the entire genome simultaneously.
Clustering: is the process of organizing objects into groupings (clusters) where members of one group are similar to each other but dissimilar to the objects belonging to other groups. In the field of machine learning it is assigned under the category of unsupervised learning as we have to find structure in unlabelled data.
Bayesian Network: or belief network is a probabilistic graphical model that represents a set of variables and their probabilistic dependencies. For example, a Bayesian network can be used to calculate the probability of a disease given the expression levels of certain genes. Expert knowledge is required in order to specify the structure and probabilistic dependencies among variables (genes and disease).
Gene Ontology: also commonly referred to as GO, provides a controlled vocabulary (ontology) to describe gene and gene product attributes in various organisms. It has three sub parts that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. It was developed to address the need for consistent descriptions of gene products in different databases (from different or the same organisms).
Microarray: also known as a gene chip, DNA chip, or gene array is glass slide on which there is a grid pattern of small spots each of which will react with single individual genes. They are commonly used for measuring expression levels of thousands of genes simultaneously, a technique called expression profiling. For example, microarrays can be used to identify disease genes by comparing gene expression in diseased and normal cells.