Data Integration Issues and Opportunities in Biological XML Data Management

Data Integration Issues and Opportunities in Biological XML Data Management

Marco Mesiti (DICO, Università di Milano, Italy), Ernesto Jiménez Ruiz (DLSI, Universitat Jaume I, Spain), Ismael Sanz (Universitat Jaume I, Spain), Rafael Berlanga Llavori (DLSI, Universitat Jaume I, Spain), Giorgio Valentini (Università di Milano, Italy), Paolo Perlasca (Università di Milano, Italy) and David Manset (maatG, France)
DOI: 10.4018/978-1-60566-308-1.ch012
OnDemand PDF Download:


There is a proliferation of research and industrial organizations that produce sources of huge amounts of biological data issuing from experimentation with biological systems. In order to make these heterogeneous data sources easy to use, several efforts at data integration are currently being undertaken based mainly on XML. Starting from a discussion of the main biological data types and system interactions that need to be represented, the authors deal with the main approaches proposed for their modelling through XML. Then, they show the current efforts in biological data integration and how an increasing amount of Semantic information is required in terms of vocabulary control and ontologies. Finally, future research directions in biological data integration are discussed.
Chapter Preview


Bioinformatics is the science of storing, extracting, organizing, analyzing, interpreting, and utilizing information from biological sequences and molecules. It has been fuelled mainly by advances in DNA sequencing and genome mapping techniques. Great opportunities arise for developing novel data analysis methods. Some of the great challenges in bioinformatics include protein structure prediction, homology search, multiple alignment and phylogeny construction, genomic sequence analysis and gene finding, as well as applications in gene expression data analysis, drug discovery in the pharmaceutical industry, etc. Nowadays, there is a proliferation of research institutions that produce sources of huge amounts of biological data derived from experimentation with biological systems. These data sources can be fully exploited only if a great effort is made to integrate disparate data formats, protocols and tools. Data integration and system interoperability are currently being undertaken in order to overcome the high level of heterogeneity currently present in the available resources.

One way to expand the utility and interpretability of the individual resources would be to create a standard unified model for the description of data and, consequently, a format for their exchange and representation that is machine readable. In the literature, we can find several data formats intended to represent biological entities and systems: non XML-based, XML-based and ontology-based files. FASTA (Pearson, 1994) is an example of a non XML-based data format for the representation of sequence data. The main problem with this type of format is the lack of structure consistency, thereby leading to a possibly different interpretation of a correct file. The second group tries to overcome the problem of a consistent structure definition by using XML as the data format. Within this group, two approaches are distinguished depending on how the structure is validated, that is, whether they are using XML document type definitions (DTDs) or XML schema definitions (XSDs). The use of XSDs can be richer than DTDs since users can specify not only the structure but also the Semantics of the XML tags by defining conditions and constraints. However, the potential of XSDs is addressed in only a few proposals. Finally, ontology-based formats have emerged as a solution to the lack of Semantics and will allow the formal representation of the knowledge to be exchanged. The ontology Web language (OWL) and the open biomedical ontology (OBO) are the main languages used to represent ontologies. Unlike XML Schemas, the use of well-defined ontologies will guarantee the correct representation of the content Semantics. It must be taken into account that in the ontology-based group, we also consider the XML formats that link the content with ontologies or controlled vocabularies.

It is worth mentioning that some of the previous efforts have also defined a specific XML-like language (i.e. SBML, BioPax, PSI-MI) for the representation of biological data. The aim of these efforts is the creation of public and well-known standards, so that data source providers are able to format and externalize their biological data according to the schemas and restrictions given by the standard. This is quite an interesting approach since data source providers know how to format the data to be externalized and shared, and this data can be easily used and integrated in applications such as Taverna or Pegasys to compose complex workflows. However, as will be discussed in this chapter, other efforts are still necessary in several directions.

Another important open issue will be the selection of relevant information from XML files. Since biological data will come from heterogeneous sources, it requires the identification of approximate retrieval systems, specifically tailored for Bioinformatics, in order to extract interesting portions from XML files. The possibility of using similarity measures that can be adapted depending on the context appears to be a promising research direction.

The rest of the chapter is organized as follows. In the first part, the biological domain in which the main structures for entities and interactions exist are described. Also in this section, the main XML-like exchange standards are presented. The next section presents a survey of some XML-based systems for biological data integration, and discusses their main limitations. Finally, the last part of the chapter presents the set of new and necessary directions and trends in biological data integration.

Complete Chapter List

Search this Book:
Table of Contents
Ernesto Damiani
Eric Pardede
Eric Pardede
Chapter 1
Mary Ann Malloy, Irena Mlynkova
As XML technologies have become a standard for data representation, it is inevitable to propose and implement efficient techniques for managing XML... Sample PDF
Closing the Gap Between XML and Relational Database Technologies: State-of-the-Practice, State-of-the-Art and Future Directions
Chapter 2
Mirella M. Moro, Lipyeow Lim, Yuan-Chi Chang
It is well known that XML has been widely adopted for its flexible and self-describing nature. However, relational data will continue to co-exist... Sample PDF
Challenges on Modeling Hybrid XML-Relational Databases
Chapter 3
Vassiliki Koutsonikola, Athena Vakali
Nowadays, XML has become the standard for representing and exchanging data over the Web and several approaches have been proposed for efficiently... Sample PDF
XML and LDAP Integration: Issues and Trends
Chapter 4
Giovanna Guerrini, Marco Mesiti
The large dynamicity of XML documents on the Web has created the need to adequately support structural changes and to account for the possibility of... Sample PDF
XML Schema Evolution and Versioning: Current Approaches and Future Trends
Chapter 5
Mingzhu Wei, Ming Li, Elke A. Rundensteiner, Murali Mani, Hong Su
Stream applications bring the challenge of efficiently processing queries on sequentially accessible XML data streams. In this chapter, the authors... Sample PDF
XML Stream Query Processing: Current Technologies and Open Challenges
Chapter 6
Sven Groppe, Jinghua Groppe, Christoph Reinke, Nils Hoeller, Volker Linnemann
The widespread usage of XML in the last few years has resulted in the development of a number of XML query languages like XSLT or the later... Sample PDF
XSLT: Common Issues with XQuery and Special Issues of XSLT
Chapter 7
Mirella M. Moro, Zografoula Vagena, Vassilis J. Tsotras
Content-based routing is a form of data delivery whereby the flow of messages is driven by their content rather than the IP address of their... Sample PDF
Recent Advances and Challenges in XML Document Routing
Chapter 8
Philippe Poulard
XML engines are usually designed to solve a single class of problems: transformations of XML structures, validations of XML instances, Web... Sample PDF
Native XML Programming: Make Your Tags Active
Chapter 9
Stéphane Bressan, Wee Hyong Tok, Xue Zhao
Since XML technologies have become a standard for data representation, a great amount of discussion has been generated by the persisting open issues... Sample PDF
Continuous and Progressive XML Query Processing and its Applications
Chapter 10
Fabio Grandi, Federica Mandreoli, Riccardo Martoglia
In several application fields including legal and medical domains, XML documents are “versioned” along different dimensions of interest, whose... Sample PDF
Issues in Personalized Access to Multi-Version XML Documents
Chapter 11
Tran Khanh Dang
In an outsourced XML database service model, organizations rely upon the premises of external service providers for the storage and retrieval... Sample PDF
Security Issues in Outsourced XML Databases
Chapter 12
Marco Mesiti, Ernesto Jiménez Ruiz, Ismael Sanz, Rafael Berlanga Llavori, Giorgio Valentini, Paolo Perlasca, David Manset
There is a proliferation of research and industrial organizations that produce sources of huge amounts of biological data issuing from... Sample PDF
Data Integration Issues and Opportunities in Biological XML Data Management
Chapter 13
Doulkifli Boukraa, Riadh Ben Messaoud, Omar Boussaid
Current data warehouses deal for the most part with numerical data. However, decision makers need to analyze data presented in all formats which one... Sample PDF
Modeling XML Warehouses for Complex Data: The New Issues
Chapter 14
Irena Mlynkova
Since XML technologies have become a standard for data representation, numerous methods for processing XML data emerge every day. Consequently, it... Sample PDF
XML Benchmarking: The State of the Art and Possible Enhancements
About the Contributors