Article Preview
Top1. Introduction
Biomedical document repositories are ubiquitous on the web in the form of biomedical research literature, business/domain articles, healthcare guidelines and medical reports. Retrieval of these healthcare documents is a challenging task since most of their content is in the form of unstructured text (Malet, Munoz, Appleyard, & Hersh, 1999). Metadata, in the form of author information, title and journal name have been used traditionally to assist in the information retrieval task from large medical databases. While metadata enables basic search, meaningful retrieval of healthcare documents based on their free-flowing text content poses a significant challenge. Existing means of retrieving healthcare documents primarily depends on general web search engines such as Google and Yahoo, or domain specific search engines (ex., Entrez) that crawl and index documents from online database, such as PubMed Central (PMC) (Xu, McCusker, & Krauthammer, 2008). Additionally, biomedical documents are multi-modal in nature, that is, they contain free flowing text, semi-structured text and figures.
Embedded figures form an essential part of biomedical documents. They are known to be useful for the task of biomedical document mining. Figures also provide useful information about the semantic category of the document content (Shatkay, Chen, & Blostein, 2006). For example, biomedical research literature primarily consists of biomedical images and plots whereas business/domain articles tend to have more diagrams. Similar behavior is also observed in healthcare guidelines and medical reports which primarily contains biomedical images along with text. These document level semantic associations can be useful to categorize a given document. Use of embedded figures instead of natural language processing techniques on documents has been studied for biomedical document retrieval in Chowattanakul, Rai and Radha Krishna (2011). In the present work, we categorize embedded figures for document classification and perform similarity computation of figures for retrieval.
Figure retrieval from healthcare documents is a relatively new research area. Past work in this area focuses on specific biomedical domain. As a result the categories defined for the figures are problem specific. We study related literature on figure retrieval task and identify common figure categories that can be used for retrieval.
Use of embedded figures to navigate through associated documents in the database was introduced in Xu et al. (2008). Along these lines, Esteban and Iossifov (2009) advocated the application of image features from embedded figures in combination with embedded text to categorize biomedical documents into several classes within systems biology. Both the schemes categorize the embedded figures into one of the following categories: sets, graph, diagram, time, gel, microscopy, pathway and structure. While the first two classes are generic, that is they are not tied to a specific application domain, remaining classes are specific to systems biology. Several domain specific classes are an instance/subset of the first two generic classes. For example, time represents data plotted over time; gel and microscopy form a part of biomedical images, while pathway and structure are a sub-category of diagram. Similarly in Shatkay et al. (2006), the authors categorized the embedded figures in biomedical documents into Graphical and Experimental images. Graphical images are sub-categorized into Diagrams, Bar Chart and Line Chart, while Gel Electrophoresis and Microscopy images fall within the class of Experimental figures. These categories are further used in combination with text for their meaningful classification.
In this work, we identify the common classes (for embedded figures) from existing literature on biomedical figure retrieval (Xu, 2008; Esteban, 2009; Shatkay, 2006). There are four general classes, namely Diagram (or Chart), Plot, Geometrical Shapes and Biomedical Images that encompass all kinds of figures observed in biomedical documents (see Figure 1). Based on this information, we analyze the structural properties of figures and develop a retrieval based scheme for identifying figure categories.
Figure 1. Embedded biomedical figures depicting samples of the four categories, diagrams, plots, biomedical images and geometrical shapes (row-wise)