Combining Supervised Learning Techniques to Key-Phrase Extraction for Biomedical Full-Text

Combining Supervised Learning Techniques to Key-Phrase Extraction for Biomedical Full-Text

Yanliang Qi, Min Song, Suk-Chung Yoon, Lori deVersterre
DOI: 10.4018/978-1-4666-2047-6.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Key-phrase extraction plays a useful a role in research areas of Information Systems (IS) like digital libraries. Short metadata like key phrases are beneficial for searchers to understand the concepts found in the documents. This paper evaluates the effectiveness of different supervised learning techniques on biomedical full-text: Sequential Minimal Optimization (SMO) and K-Nearest Neighbor, both of which could be embedded inside an information system for document search. The authors use these techniques to extract key phrases from PubMed and evaluate the performance of these systems using the holdout validation method. This paper compares different classifier techniques and performance differences between the full-text and it’s abstract. Compared with the authors’ previous work, which investigated the performance of Naïve Bayes, Linear Regression and SVM(reg1/2), this paper finds that SVMreg-1 performs best in key-phrase extraction for full-text, whereas Naïve Bayes performs best for abstracts. These techniques should be considered for use in information system search functionality. Additional research issues also are identified.
Chapter Preview
Top

Introduction

In recent years, there has been a tremendous increase in the number of biomedical documents in digital libraries that provide users (researchers, readers) with access to the scientific and technical literature of those biomedical documents (articles or abstract) (Liu, 2007). For example, the PubMed digital library (a free search engine for accessing the MEDLINE database of biomedical research articles) currently contains over 18 million citations from various types of biomedical documents published in the past several decades (www.pubmed.gov). With the rapid expansion of the number of biomedical documents, the ability to effectively determine the relevant documents from a large dataset has become increasingly difficult for users. As it is a challenging task for a reader to examine complete documents to determine whether the document would be useful, short semantic metadata like key-phrases would be an alternative for a reader to understand the concept of the document (Hamdi, 2008). Key phrases are increasingly used as brief descriptors of text document content. However, not all of the biomedical documents in digital libraries have key phrases, so readers have to read through the documents to determine whether they are relevant to their research. Therefore automatically presenting key phrases from a document has become an important task in the biomedical domain.

Automatic key-phrase extraction can be defined as the process of extracting key phrases from a document that an author (or a professional indexer) is likely to assign to that document (El-Beltagy, 2006). Consequently, automatic extraction makes it feasible to generate key phrases for a large number of full-text documents that do not have manually assigned key phrases. It also reduces the cost and time spent manually assigning key phrases to documents (Zhang, Zincir-Heywood, et al., 2005). Key-phrases, short semantic metadata, are useful for various purposes including summarizing as well as search engine optimization. Using key phrases for full-text documents can vary: when they are presented on the first page of the document, the goal is summarization, which enables the users to quickly determine the concept of the document; when they are entered in a search engine query box in a digital library, the goal is to enable the users to make the search more precise (Turney, 2000). Therefore, they play an important role in document descriptions and document search in digital libraries, e.g., PubMed.

Traditionally, key-phrases are assigned manually to documents by authors or professional indexers. The indexers often choose key phrases from a predefined control vocabulary: Medical Subject Heading (MeSH). Authors usually choose key phrases to present their work in a certain way or to maximize its chance of being noticed by particular searchers. However, issues with this manual assignment of key-phrases are (1) it is a time consuming process, (2) it requires knowledge of subject matter, and (3) entails an updated control vocabulary list (Witten, Paynter et al., 1999; Kumar & Srinathan, 2008). Automatic key phrase extraction can be a good practical alternative.

Key-phrases can be automatically generated in two ways: (1) key-phrase assignment (controlled-vocabulary indexing based), which assigns key-phrases from a controlled vocabulary to documents or (2) key-phrase extraction (free-term indexing based), which identifies and selects the most descriptive phrases in that document (Dumais, Platt et al., 1998).

In domain-specific control-indexing, key-phrases are chosen from a controlled vocabulary such as the MeSH terminology list (Medelyan & Witten, 2006). MeSH provides a consistent way to assign phrases to biomedical documents that have the same concept. However the downsides are that the lists are expensive to build and maintain, so they are not always up to date and potentially useful phrases are ignored if they are not in the list (Jones & Paynter, 2003).

Complete Chapter List

Search this Book:
Reset