Ensemble Classification System for Scientific Chart Recognition from PDF Files

Ensemble Classification System for Scientific Chart Recognition from PDF Files

S. Nagarajan (Department of Computer Science, K. S. R. College of Arts and Science, Tiruchengode, Tamilnadu, India) and V. Karthikeyani (Department of Computer Science, Thiruvalluvar Government Arts College, Rasipuram, Tamilnadu, India)
Copyright: © 2012 |Pages: 10
DOI: 10.4018/ijcvip.2012100101
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Portable Document Format (PDF) is the most frequently used universal document format on the Internet and E-Publishing. Wide usage of PDF files has increased the need of conversion tools that convert PDF file content to text or HTML formats. A PDF converter can be categorized into two domains, namely, text recognition and graphics recognition. This paper focus on graphic recognition, especially chart type identification, which is concerned with developing algorithms that has the ability to determine the type of a given chart image from a PDF file. In the proposed system, initially an enhanced connected component and statistical feature based method is used to separate the chart region from other regions. The chart region is then analyzed and grouped as either 2-dimensional or 3-dimensional chart. After separating the graphic component from the text components, feature extraction is performed. The features can be grouped as object features, texture features and shape features. The combined feature vector is then classified using ensemble classification system. Experimental results show that the chart separation, feature extraction and ensemble classification models significantly improve the quality of chart identification.
Article Preview

1. Introduction

Portable Document Format (PDF), introduced by Adobe systems in 1993, is considered as the most frequently used universal document format on the Internet and E-Publishing. The reason behind such popularity is that it presents a hardware and software independent platform for people to share their ideas and work in digital form. The task of creating PDF files is extremely simple and fast and can solve the undesirable formatting problems at the receiver’s side. A PDF file has an added advantage of requiring small storage space. In the information explosion era, where both software, hardware and communication medium is envisaging tremendous growth, usage of PDF documents in businesses helps to build good reputation and helps in ‘paperless office’ environment. A PDF file is protected in the sense that it is not possible to hamper or change content directly.

Wide usage of PDF files has increased the use of conversion tools in two fashions. The first type is used to convert a source document into PDF formation and the second type is used to convert PDF file content to text or HTML formats. A PDF file can be created from any source document or application like word, excel, PowerPoint and even an image. A PDF file can accommodate various types of data including text, hyperlinks, mathematical formulae, pictures, tables and charts. PDF to text/HTML converters are much used in situations where users have difficulty in reading the multiple column or small font documents and incapability of devices (like embedded systems) to handle PDF formats. However, designing such a converter is a challenging problem because of the various types of data in PDF files. A PDF-to-HTML Converter (PHC), apart from being fast and easy to use, should also have highest recognition accuracy and should be able to retain the format of the original document.

For this purpose, the existing PHC can be categorized into two domains, namely, text recognition and graphics recognition. Text recognition focus on the textual part of the PDF files and uses Optical Character Recognition (OCR) algorithms during conversion (Islam et al., 2009; Martinez-Alvarez et al., 2010). Graphics recognition, on the other hand, is focused on the lines and symbols of the PDF file. Graphics in PDF files include diagrams, maps, engineering drawing and scientific charts. Much of the work reported in literature focus on the first category, that is, text recognition. Knowledge mining from graphics is still very sparse, as though rich in information content are more complex and unwieldy to process than text.

Out of the various graphical objects, this paper recognizes one particular type of graphic objects, namely, scientific charts. Scientific charts are frequently embedded objects in PDF files and are used to convey a clear analysis of scientific or research results and commercial data trends. Scientific charts in PDF files can be viewed as an object composed of graphics and text elements arranged in a regular fashion. The graphics elements have simple syntactic and semantic rule constraints and can provide a concise representation for data analysis. Because of these, scientific charts are extensively used in many applications. While considering scientific applications, like research articles and e-journals, scientific charts and figures occupy more than 50% of the PDF file content (Shao & Futrelle, 2006). Scientific charts can be created using a variety of tools and programs and hence there exists lot of patterns to represent them. Examples include bar charts, line charts and pie charts. Thus, with the many different types available, a general chart recognition algorithm for PDF documents has become imperative. Chart recognition is concerned with the problem of identifying the chart type, conversion of information in chart into computer readable form. This paper is focused on the first problem, that is, to develop algorithm that has the ability to determine the type of a given chart image form a PDF file. For this purpose, the usage of ensemble classification is proposed. Classification, a frequently used data mining approach, is a process that separates a set of charts according to their visual content into one of the number of predefined categories. Each category is represented by a set of features and the classification algorithm maps these features to a class using machine learning algorithms. Ensemble classification, a method used to improve the classification accuracy, uses multiple classifiers for this purpose.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017): 3 Released, 1 Forthcoming
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing