Extraction and Prediction of Biomedical Database Identifier Using Neural Networks towards Data Network Construction

Extraction and Prediction of Biomedical Database Identifier Using Neural Networks towards Data Network Construction

Hendrik Mehlhorn (Institute of Plant Genetics and Crop Plant Research, Germany), Matthias Lange (Institute of Plant Genetics and Crop Plant Research, Germany), Uwe Scholz (Institute of Plant Genetics and Crop Plant Reseearch, Germany) and Falk Schreiber (Institue for Plant Genetics and Crop Plant Research, Germany & Martin Luther University, Germany)
Copyright: © 2013 |Pages: 26
DOI: 10.4018/978-1-4666-2827-4.ch004


In this work, we investigate to what extent an automated construction of an integrated data network is possible. We propose a method that predicts and extracts cross-references from multiple life science databases and possible referenced data targets. We study the retrieval quality of our method and report on first, promising results.
Chapter Preview

1. Introduction

Bioinformatics is the field of science in which biology, computer science, and in particular information retrieval merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights. The first step in this direction is already done. High throughput biotechnologies, like next generation sequencing, proteomics and metabolomics techniques produce a massive amount of data (Galperin & Fernandez-Suarez, 2012). But the data gathered in biology or medicine is as manifold as the biological research areas itself. If we will narrow down in this chapter the complex areas of biomedical research to molecular biology, bioinformatics attempts to model and interprets this data pathway: genome, gene sequence, protein sequence, protein structure, protein function, cellular pathways & networks, and biomedical literature. The first consequence of this revolution is the explosion of available data that biomolecular researchers have to harness and exploit (Roos, 2001) (e.g., as of March 2012, Genbank provides access to 150,000,000 DNA sequences1 and in PubMed there are 2,400,000 research articles listed). The number of public available databases passed currently the number of high water mark of 1,200 (Galperin & Fernandez-Suarez, 2012).

The big players in this context are on the one hand companies like pharmaceutical or plant breeders on the other hand public or private financed research institute. Their role is either a data consumer or a data producer. In consequence there is a raising need for find, extract, merge, and synthesize information from multiple, disparate sources. Convergence of biology, computer science, and information technology will accelerate this multidisciplinary endeavor. The basic needs are formulated in Lacroix & Critchlow, 2003:

  • 1.

    On demand access and retrieval of the most up-to-date biological data and the ability to perform complex queries across multiple heterogeneous databases to find the most relevant information.

  • 2.

    Access to the best-of-breaded analytical tools and algorithms for extraction of useful information from the massive volume and diversity of biological data.

  • 3.

    A robust information integration infrastructure that connects various computational steps involving database queries, computational algorithms, and application software.

In consequence, database integration plays an important role in this context. Thus, we will subsequently briefly introduce the most popular concepts for database integration in life science. Using the World Wide Web or social networks as inspiring example, the basic idea presented in this chapter is to compute a network of biomedical knowledge by taking a set of database entries as input, analyzing the entries and their attributes and identifying potential cross-references in the same and in other databases. We propose IDPredictor, an algorithm that predicts cross-references from multiple life science databases and thus sets the basis for an enhanced information retrieval over biomedical data. We discuss to what extend IDPredictor can be used as method for an efficient and precise prediction of database cross-references.

In Section 1 we give a brief introduction to data management in life sciences. In particular approaches for data integration, information retrieval and aspects of data identifier are discussed. In Section 2 we present the underlying machine learning methods of IDPredictor. In Section 3 we discuss training methods and prediction performance measures. In Section 4 we discuss the prediction performance, preliminary results and the application to database networks.

Complete Chapter List

Search this Book: