Figure Based Biomedical Document Retrieval System using Structural Image Features

Figure Based Biomedical Document Retrieval System using Structural Image Features

Harikrishna G. N. Rai (Infosys Labs, Infosys Limited, Bangalore, India), K Sai Deepak (Infosys Labs, Infosys Limited, Bangalore, India) and P. Radha Krishna (Infosys Labs, Infosys Limited, Hyderabad, India)
Copyright: © 2012 |Pages: 20
DOI: 10.4018/jkdb.2012010103
OnDemand PDF Download:


Multi-modal and Unstructured nature of documents make their retrieval from healthcare document repositories a challenging task. Text based retrieval is the conventional approach used for solving this problem. In this paper, the authors explore an alternate avenue of using embedded figures for the retrieval task. Usually, context of a document is directly reflected in the associated figures, therefore embedded text within these figures along with image features have been used for similarity based retrieval of figures. The present work demonstrates that image features describing the structural properties of figures are sufficient for the figure retrieval task. First, the authors analyze the problem of figure retrieval from biomedical literature and identify significant classes of figures. Second, they use edge information as a means to discriminate between structural properties of each figure category. Finally, the authors present a methodology using a novel feature descriptor namely Fourier Edge Orientation Autocorrelogram (FEOAC) to describe structural properties of figures and build an effective Biomedical document retrieval system. The experimental results demonstrate the better retrieval performance and overall improvement of FEOAC for figure retrieval task, especially when most of the edge information is retained. Apart from invariance to scale, rotation and non-uniform illumination, the proposed feature descriptor is shown to be relatively robust to noisy edges.
Article Preview

1. Introduction

Biomedical document repositories are ubiquitous on the web in the form of biomedical research literature, business/domain articles, healthcare guidelines and medical reports. Retrieval of these healthcare documents is a challenging task since most of their content is in the form of unstructured text (Malet, Munoz, Appleyard, & Hersh, 1999). Metadata, in the form of author information, title and journal name have been used traditionally to assist in the information retrieval task from large medical databases. While metadata enables basic search, meaningful retrieval of healthcare documents based on their free-flowing text content poses a significant challenge. Existing means of retrieving healthcare documents primarily depends on general web search engines such as Google and Yahoo, or domain specific search engines (ex., Entrez) that crawl and index documents from online database, such as PubMed Central (PMC) (Xu, McCusker, & Krauthammer, 2008). Additionally, biomedical documents are multi-modal in nature, that is, they contain free flowing text, semi-structured text and figures.

Embedded figures form an essential part of biomedical documents. They are known to be useful for the task of biomedical document mining. Figures also provide useful information about the semantic category of the document content (Shatkay, Chen, & Blostein, 2006). For example, biomedical research literature primarily consists of biomedical images and plots whereas business/domain articles tend to have more diagrams. Similar behavior is also observed in healthcare guidelines and medical reports which primarily contains biomedical images along with text. These document level semantic associations can be useful to categorize a given document. Use of embedded figures instead of natural language processing techniques on documents has been studied for biomedical document retrieval in Chowattanakul, Rai and Radha Krishna (2011). In the present work, we categorize embedded figures for document classification and perform similarity computation of figures for retrieval.

Categories of Figures in Biomedical Literature

Figure retrieval from healthcare documents is a relatively new research area. Past work in this area focuses on specific biomedical domain. As a result the categories defined for the figures are problem specific. We study related literature on figure retrieval task and identify common figure categories that can be used for retrieval.

Use of embedded figures to navigate through associated documents in the database was introduced in Xu et al. (2008). Along these lines, Esteban and Iossifov (2009) advocated the application of image features from embedded figures in combination with embedded text to categorize biomedical documents into several classes within systems biology. Both the schemes categorize the embedded figures into one of the following categories: sets, graph, diagram, time, gel, microscopy, pathway and structure. While the first two classes are generic, that is they are not tied to a specific application domain, remaining classes are specific to systems biology. Several domain specific classes are an instance/subset of the first two generic classes. For example, time represents data plotted over time; gel and microscopy form a part of biomedical images, while pathway and structure are a sub-category of diagram. Similarly in Shatkay et al. (2006), the authors categorized the embedded figures in biomedical documents into Graphical and Experimental images. Graphical images are sub-categorized into Diagrams, Bar Chart and Line Chart, while Gel Electrophoresis and Microscopy images fall within the class of Experimental figures. These categories are further used in combination with text for their meaningful classification.

In this work, we identify the common classes (for embedded figures) from existing literature on biomedical figure retrieval (Xu, 2008; Esteban, 2009; Shatkay, 2006). There are four general classes, namely Diagram (or Chart), Plot, Geometrical Shapes and Biomedical Images that encompass all kinds of figures observed in biomedical documents (see Figure 1). Based on this information, we analyze the structural properties of figures and develop a retrieval based scheme for identifying figure categories.

Figure 1.

Embedded biomedical figures depicting samples of the four categories, diagrams, plots, biomedical images and geometrical shapes (row-wise)

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 7: 2 Issues (2017): 1 Released, 1 Forthcoming
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing