MINOTAUR: A Web-Based Annotator-Assistant Tool

MINOTAUR: A Web-Based Annotator-Assistant Tool

Alex L. Mitchell (University of Manchester and European Bioinformatics Institute, UK), Ioannis Selimas (University of Manchester, UK) and Teresa K. Attwood (University of Manchester, UK)
DOI: 10.4018/ijsbbt.2012010101


In recent years, the rapid accumulation of biological data and the corresponding enormous growth in the number of published research papers has rendered data annotation and literature searching immensely laborious tasks. The problems are particularly acute for biocurators, who often need to move quickly and easily, say, from a set of biomolecular sequences, to relevant literature search terms, and thence to a core set of informative sentences that might help their annotation efforts. To this end, the authors have developed MINOTAUR, a Web-based assistant tool that helps biologists and biocurators to find relevant facts in online databases and biomedical literature abstracts. The software takes a variety of inputs (sequences, keywords, etc.), ranks returned documents according to relevance, and then extracts pertinent sentences using machine-learning and rule-based sentence-classification systems. These may be collated and presented as blocks of text to seed manual-annotation processes. The suite is available for interactive use at http://www.bioinf.manchester.ac.uk/dbbrowser/minotaur/about.html
Article Preview


The accumulation of biological information has accelerated rapidly in recent years, owing largely to progress in high-throughput biology. As the flood of sequence data grows ever larger, and the number of publications describing the data continues to surge, biomedical scientists face information overload. Simply keeping pace is now a major challenge, and many researchers struggle to remain ‘expert’ even in their own highly specific fields (Howe et al., 2008; Fraser & Dunstan, 2010). This problem is particularly acute for curators of bio-databases, for example such as UniProt (The UniProt Consortium, 2011) and InterPro (Hunter et al., 2009), the largest extensively annotated protein sequence and protein family archives publicly available. Here, curators’ work depends on being able to find pertinent information relating to biological entities so that they can both validate and annotate new database entries, and ensure that information associated with existing entries stays up-to-date.

This situation has spurred a number of computational approaches to try to extract information automatically from the biomedical literature. Some of these have approached the problem as a traditional text-mining task, producing tools to address information retrieval (e.g., ReleMed (Siadaty et al., 2007), PubFocus (Plikus et al., 2006) and MedBlast (Tu et al., 2004)) and/or entity recognition as a prelude to information extraction (e.g., iHOP (Hoffmann & Valencia, 2004), Whatizit (Rebholz-Schuhmann et al., 2008), Textpresso (Muller et al., 2004) and EBIMed (Rebholz-Schuhmann et al., 2007)). Other systems, such as BioRat (Corney et al., 2004) and ASSERT (Ananiadou et al., 2009), have adopted more ‘synthetic’ approaches, endeavouring to provide summaries that obviate the need for users to read entire articles, or sets of articles. The intractability of natural language to computational analysis is well-known – the hurdles it poses to algorithmic attack are simply enormous. Initiatives like these are therefore extremely valuable, and the tools that have been produced represent important steps forward along the path to fully automatic annotation processes of the future. However, the primary focus of each of these pieces of software has tended to be on one part of a biocurator’s typical annotation ‘workflow’; to get to useful or informative sentences within relevant literature, therefore, curators routinely have to ‘mix and match’, using a number of different tools to address particular tasks within their annotation pipelines. Faced with the growing quantities of data yielded by today’s high-throughput biology, this can be a disjointed, relentlessly time-consuming and often demoralising process.

Other, 'real life', systems have been developed for more general audiences, and hence tend to summarise the rather more general results and/or news feeds from Internet search engines: e.g., Shablast, which uses Microsoft's Bing search engine; or the Ultimate Research Reporter, which combines Internet search with text-mining techniques, concept extraction, text summarisation, tag clouds, etc., to provide an ‘easy to understand research report’. While handy organisers for generic search results, providing overarching definitions and ordered lists of links to relevant sites and documents, such tools are of limited value for biomedical researchers. Specifically, they do not harness information directly from the primary sequence archives, nor from the biomedical literature cited within those archives, nor from the wider scientific literature relevant to entries within them.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 3: 1 Issue (2015)
Volume 2: 4 Issues (2013)
Volume 1: 4 Issues (2012)
View Complete Journal Contents Listing