Technological advances in high-throughput techniques and efficient data gathering methods, coupled computational biology efforts, have resulted in a vast amount of life science data often available in distributed and heterogeneous repositories. These repositories contain information such as sequence and structure data, annotations for biological data, results of complex computations, genetic sequences and multiple bio-datasets. However, the heterogeneity of these data, have created a need for research in resource integration and platform independent processing of investigative queries, involving heterogeneous data sources. When processing huge amounts of data, information integration is one of the most critical issues, because it’s crucial to preserve the intrinsic semantics of all the merged data sources. This integration would allow the proper organization of data, fostering the analysis and access the information to accomplish critical tasks, such as the processing of micro-array data to study protein function and medical researches in making detailed studies of protein structures to facilitate drug design (Ignacimuthu, 2005). Furthermore, DNA micro-array research community urgently requires technology to allow up-to-date micro-array data information to be found, accessed and delivered in a secure framework (Sinnot, 2007). Several research disciplines, such as Bioinformatics, where information integration is critical, could benefit from harnessing the potential of a new approach: the Semantic Web (SW). The SW term was coined by Berners-Lee, Hendler and Lassila (2001) to describe the evolution of a Web that consisted of largely documents for humans to read towards a new paradigm that included data and information for computers to manipulate. The SW is about adding machine-understandable and machine-processable metadata to Web resource through its key-enabling technology: ontologies (Fensel, 2002). Ontologies are a formal explicit and shared specification of a conceptualization. The SW was conceived as a way to solve the need for data integration on the Web. This article expounds SAMIDI, a Semantics-based Architecture for Micro-array Information and Data Integration. The most remarkable innovation offered by SAMIDI is the use of semantics as a tool for leveraging different vocabularies and terminologies and foster integration. SAMIDI is composed of a methodology for the unification of heterogeneous data sources from the analysis of the requirements of the unified data set and a software architecture.
This section introduces Bioinformatics and its need to process massive amounts of data; the benefit of the integration of the existing data sources of biological information and semantics, a tool for integration.
The term Bioinformatics was coined by Hwa Lim in the late 1980s, and later popularized through its association with the human genome project (Goodman, 2002). Bioinformatics is the application of information science and technologies for the management of biological data (Denn & MacMullen, 2002) and it describes any use of computers to store, compare, retrieve, analyze or predict the composition of the structure of biomolecules (Segall & Zhang, 2006). Research on Biology requires Bioinformatics to manipulate and discover new biological knowledge at several levels of increasing complexity. Biological data are produced through high-throughput methods (Vyas & Summers, 2005), which means that they have to be represented and stored in different formats, such as micro-arrays.
Key Terms in this Chapter
DNA Micro-Array: Collection of microscopic DNA spots attached to a solid surface forming an array for the purpose of expression profiling, which monitors expression levels for thousands of genes simultaneously.
MGED: Ontology for micro-array experiments that establishes concepts, definitions, terms and resources for standardized description of a micro-array experiment in support of MAGE.
MIAME: XML based standard for the description of micro-array experiments, which stands for Minimum Information About a Micro-array Experiment.
Ontology: The specification of a conceptualization of a knowledge domain. It’s a controlled vocabulary that describes objects and the relations among them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest.
Semantic Information Model Methodology: Set of activities, together with their inputs and outputs, aimed at the transformation of a collection of micro-array data sources into a semantically integrated and unified representation of the information stored in the data sources.
Unifying Information Model (UIM): Construction that brings together all the physical data schemas related to the data sources to be integrated. It’s built to represent an agreed-upon scientific view and vocabulary which will be the foundation to understand the data.
MAGE: Standard micro-array data model and exchange format that is able to capture information specified by MIAME.
Bioinformatics: Application of information science and technologies for the management of biological data and the use of computers to store, compare, retrieve, analyze or predict the composition of the structure of biomolecules.