Article Preview
TopBackground
The accumulation of biological information has accelerated rapidly in recent years, owing largely to progress in high-throughput biology. As the flood of sequence data grows ever larger, and the number of publications describing the data continues to surge, biomedical scientists face information overload. Simply keeping pace is now a major challenge, and many researchers struggle to remain ‘expert’ even in their own highly specific fields (Howe et al., 2008; Fraser & Dunstan, 2010). This problem is particularly acute for curators of bio-databases, for example such as UniProt (The UniProt Consortium, 2011) and InterPro (Hunter et al., 2009), the largest extensively annotated protein sequence and protein family archives publicly available. Here, curators’ work depends on being able to find pertinent information relating to biological entities so that they can both validate and annotate new database entries, and ensure that information associated with existing entries stays up-to-date.
This situation has spurred a number of computational approaches to try to extract information automatically from the biomedical literature. Some of these have approached the problem as a traditional text-mining task, producing tools to address information retrieval (e.g., ReleMed (Siadaty et al., 2007), PubFocus (Plikus et al., 2006) and MedBlast (Tu et al., 2004)) and/or entity recognition as a prelude to information extraction (e.g., iHOP (Hoffmann & Valencia, 2004), Whatizit (Rebholz-Schuhmann et al., 2008), Textpresso (Muller et al., 2004) and EBIMed (Rebholz-Schuhmann et al., 2007)). Other systems, such as BioRat (Corney et al., 2004) and ASSERT (Ananiadou et al., 2009), have adopted more ‘synthetic’ approaches, endeavouring to provide summaries that obviate the need for users to read entire articles, or sets of articles. The intractability of natural language to computational analysis is well-known – the hurdles it poses to algorithmic attack are simply enormous. Initiatives like these are therefore extremely valuable, and the tools that have been produced represent important steps forward along the path to fully automatic annotation processes of the future. However, the primary focus of each of these pieces of software has tended to be on one part of a biocurator’s typical annotation ‘workflow’; to get to useful or informative sentences within relevant literature, therefore, curators routinely have to ‘mix and match’, using a number of different tools to address particular tasks within their annotation pipelines. Faced with the growing quantities of data yielded by today’s high-throughput biology, this can be a disjointed, relentlessly time-consuming and often demoralising process.
Other, 'real life', systems have been developed for more general audiences, and hence tend to summarise the rather more general results and/or news feeds from Internet search engines: e.g., Shablast, which uses Microsoft's Bing search engine; or the Ultimate Research Reporter, which combines Internet search with text-mining techniques, concept extraction, text summarisation, tag clouds, etc., to provide an ‘easy to understand research report’. While handy organisers for generic search results, providing overarching definitions and ordered lists of links to relevant sites and documents, such tools are of limited value for biomedical researchers. Specifically, they do not harness information directly from the primary sequence archives, nor from the biomedical literature cited within those archives, nor from the wider scientific literature relevant to entries within them.