BioTextRetriever: A Tool to Retrieve Relevant Papers

BioTextRetriever: A Tool to Retrieve Relevant Papers

Célia Talma Gonçalves (Instituto Superior de Contabilidade e Administração do Porto & CEISE-STI, Portugal), Rui Camacho (Universidade do Porto, Portugal) and Eugénio Oliveira (Universidade do Porto, Portugal)
Copyright: © 2011 |Pages: 16
DOI: 10.4018/jkdb.2011070102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Whenever new sequences of DNA or proteins have been decoded it is almost compulsory to look at similar sequences and papers describing those sequences in order to both collect relevant information concerning the function and activity of the new sequences and/or know what is known already about similar sequences. In current web sites and data bases of sequences there are, usually, a set of curated paper references linked to each sequence. Those links are a good starting point to look for relevant information related to a set of sequences. One way to implement such approach is to do a blast with the new decoded sequences, and collect similar sequences. Then one looks at the papers linked with the similar sequences. Most often the number of retrieved papers is small and one has to search large data bases for relevant papers. This paper proposes a process of generating a classifier based on the initially set of relevant papers. First, the authors collect similar sequences using an alignment algorithm like Blast. Then, the authors use the enlarges set of papers to construct a classifier. Finally a classifier is used to automatically enlarge the set of relevant papers by searching the MEDLINE using the automatically constructed classifier.
Article Preview

An Architecture For An Information Retrieval System

The overall goal of our work is to implement a web based search tool that receives a set of genomic or proteomic sequences and returns an ordered set of papers relevant to the study of such sequences. The initial set of sequences is supplied by a biologist together with a set of relevant keywords and an e-value (an e-value is a statistic to estimate the significance of a match between 2 sequences). These three items are the input for BioTextRetriever (Gonçalves, Camacho, & Oliveira, 2011) as can be seen in Figure 1. Figure 1 presents a summary of our approach that we will now describe in detail. In the following description we use NCBI as the sequence Data Base.

Figure 1.

Sequence of steps executed by BioTextRetriever when the user provides a set of initial DNA/protein sequences

In Step 1, the user (a biologist researcher) provides an initial of sequences, optionally a list of keywords, and an e-value. With these three items (sequences, keywords and e-value) and using the NCBI BLAST tool we collect a set of similar sequences together with the paper references associated to them. We could also use Ensembl with the same inputs because Ensembl may return a different set of papers references. However for the proposed work we have only used the NCBI database.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 2 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing