The present chapter discusses the use of terminological resources for Information Retrieval in the biomedical domain. The authors first introduce a number of example resources which can be used to compile terminologies for biomedical IR and explain some of the common problems with such resources including redundancy, term ambiguity, and insufficient coverage of concepts and incomplete Semantic organization of such resources for text mining purposes. They also discuss some techniques used to address each of these deficiencies, such as static polysemy detection as well as adding terms and linguistic annotation from the running text. In the second part of the chapter, the authors show how query expansion based on using synonyms of the original query terms derived from terminological resources potentially increases the recall of IR systems. Special care is needed to prevent a query drift produced by the usage of the added terms and high quality word sense disambiguation algorithms can be used to allow more conservative query expansion. In addition, they present solutions that help focus on the user’s specific information need by navigating and rearranging the retrieved documents. Finally, they explain the advantages of applying terminological and Semantic resources at indexing time. The authors argue that by creating a Semantic index with terms disambiguated for their Semantic types and larger chunks of text denoting entities and relations between them, they can facilitate query expansion, reduce the need for query refinement and increase the overall performance of Information Retrieval. Semantic indexing also provides support for generic queries for concept categories, such as genes or diseases, rather than singular keywords.
Compilation Of Lexical Resources
A number of life science data resources lending support to text mining solutions are available, although they differ in quality, coverage and suitability for IR solutions. In the following section we provide an outline of the databases commonly used to aid Information Retrieval and Information Extraction.
Public Resources for Biomedical and Chemical Terminologies TheUnified Medical Language System (UMLS)a: is a commonly used terminological resource provided by the National Library of Medicine. UMLS is a compilation of several terminologies and it contains terms denoting diseases, syndromes and gene ontology terms among others. The UMLS is characterized by a wide coverage and a high degree of concept type heterogeneity, which may make it difficult to use when a specific subset of terms is required. The UMLS Metathesaurus forms the main part of UMLS and it organizes over 1 million concepts denoted by 5 million term variants. The Metathesaurus has been used for the task of named entity recognition e.g. (Aronson et al., 2001). Also, more specialized subsets have been compiled out of this resource and used for the identification of disease names, e.g. (Jimeno et al., 2008). An assessment of UMLS’s suitability for language processing purposes was carried out by (McCray et al., 2001).