This chapter aims to describe data integration and data mining techniques in the context of systems biology studies. It argues that the different methods available in the field of data integration can be very useful in making research in the field of systems biology easier. Moreover data mining is an important task to take into account in this context, therefore in this chapter, some aspects of data mining applied to systems biology specific case studies shall be discussed. The availability of a large number of specific resources, especially for the experimental researchers, is something difficult for users who tried to explore gene, protein, and pathway data for the first time. This chapter finally aims to highlight the complexity in the systems biology data and to provide an overview of the data integration and mining approaches in the context of systems biology using a specific example for the Cell Cycle database and the Cell Cycle models simulation.
In the context of the application of biomedical science to systems biology, the availability of many different database and data resources, and a huge amount of heterogeneous data are continuously accumulating, became a crucial point in the last few years. In the field of the medical sciences, and, more in particular in the systems biology context, it is largely recognized that successful data integration has become essential in order to improve the possibility to better explore the knowledge space in many different biological studies. Experimental researchers and computer scientists can discover through data integration new and interesting relationships that enable them to make better and faster decisions for example about disease targets and drug molecules. Moreover, the collection of related information has been shown to be an essential component in biomedical and systems biology research, particularly in the genomics, proteomics and pathways information area.
The necessity for data integration is widely approved in the bioinformatics and systems biology community since bioinformatics data are currently spread across the internet and throughout organizations in a wide variety of formats. Moreover the achievement of interesting results in most bioinformatics and systems biology-related activities, from functional characterization of genomic and proteomic data to the development of mathematical models of biological processes, requires an integrated view of all relevant data useful to accomplish those tasks. The challenges of data integration may be addressed using a wide variety of approaches. While each approach has advantages and limits, it can be difficult to evaluate which approach suits a particular need best without fully understanding the data integration landscape. The data integration methods aim to facilitate detailed and accurate investigation on specific gene, protein or pathway since high information content should be useful both for data mining and mathematical modelling of the biological process of interest. In this chapter the different data integration approaches and some practical example of data integration are illustrated in the specific field of the cell cycle process. The importance of the cell cycle in the shifting from a healthy to a pathological state in some specific experimental conditions which are illustrated in the context of the need to create an integrated system capable to collect the most important information related to cell cycle genes and proteins, which are drawn from the analysis of the cell cycle information available in literature and the existing pathway databases.
There is another important technique used for the knowledge discovery is the data mining approach. Data mining system has become widely used in the context of biomedical science and systems biology as it makes the prediction of the behaviours and the future trends for a biological system possible, allowing taking knowledge-driven decisions. In its general definition data mining can also be considered as the process of analyzing data from different perspectives and summarizing it into useful information, which can be used to increase the current knowledge about a specific biological process. Technically, data mining is the process of finding correlations or patterns among many fields in large relational databases. An example of data mining application in systems biology in the context of the mathematical modelling of a biological process is illustrated in this chapter.
Moreover the use of bioinformatic tools, data mining and data integration can help researchers to better studying the modelling complexity, by screening of the potential model components in order to find the emergent properties of a biological system, which is one of the main aims of systems biology studies. Finally the main advantage of using the data mining and the data integration approaches in the context of systems biology investigations, are presented.
Key Terms in this Chapter
Cell Cycle: The series of events that take place in a eukaryotic cell leading to its replication. These events can be divided in two broad periods: interphase, during which the cell grows, accumulating nutrients needed for mitosis and duplicating its DNA, and the mitotic or M phase, during which the cell splits itself into two distinct cells, often called daughter cells . The cell cycle is a crucial process for the organisms life, by which a single-celled fertilized egg develops into a mature organism, as well as the process by which hair, skin, blood cells, and some internal organs are renewed.
Data Mining: The process through which large amounts of data are sorted with the aim to extract from them relevant information. This term is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods, especially in the biological context. It can be defined as the nontrivial extraction of previously unknown and potentially useful information from data and databases.
Data Warehouse: The main repository of an organization’s historical data, its corporate memory. It contains the raw material for management’s decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems. The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
Mathematical Model: An abstract model that uses mathematical language to describe the behaviour of a system. Mathematical models are used particularly in the natural sciences and engineering disciplines (such as physics, biology, and electrical engineering) but also in the social sciences (such as economics, sociology and political science). It can be defined as the representation of the essential aspects of an existing system (or a system to be constructed) which presents knowledge of that system in usable form.
Data Integration: The process of combining data existing in different sources and providing the user with a unified view of these data. This process is useful in many situations, in particular in the scientific environment when raise out the necessity of combining research results from different bioinformatics repositories. Data integration appears with increasing frequency as the volume and the need to share existing data explodes.