Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Ensemble Classification System for Scientific Chart Recognition from PDF Files

S. Nagarajan, V. Karthikeyani

Source Title: International Journal of Computer Vision and Image Processing (IJCVIP) 2(4)

DOI: 10.4018/ijcvip.2012100101

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Portable Document Format (PDF) is the most frequently used universal document format on the Internet and E-Publishing. Wide usage of PDF files has increased the need of conversion tools that convert PDF file content to text or HTML formats. A PDF converter can be categorized into two domains, namely, text recognition and graphics recognition. This paper focus on graphic recognition, especially chart type identification, which is concerned with developing algorithms that has the ability to determine the type of a given chart image from a PDF file. In the proposed system, initially an enhanced connected component and statistical feature based method is used to separate the chart region from other regions. The chart region is then analyzed and grouped as either 2-dimensional or 3-dimensional chart. After separating the graphic component from the text components, feature extraction is performed. The features can be grouped as object features, texture features and shape features. The combined feature vector is then classified using ensemble classification system. Experimental results show that the chart separation, feature extraction and ensemble classification models significantly improve the quality of chart identification.

Article Preview

Top

1. Introduction

Portable Document Format (PDF), introduced by Adobe systems in 1993, is considered as the most frequently used universal document format on the Internet and E-Publishing. The reason behind such popularity is that it presents a hardware and software independent platform for people to share their ideas and work in digital form. The task of creating PDF files is extremely simple and fast and can solve the undesirable formatting problems at the receiver’s side. A PDF file has an added advantage of requiring small storage space. In the information explosion era, where both software, hardware and communication medium is envisaging tremendous growth, usage of PDF documents in businesses helps to build good reputation and helps in ‘paperless office’ environment. A PDF file is protected in the sense that it is not possible to hamper or change content directly.

Wide usage of PDF files has increased the use of conversion tools in two fashions. The first type is used to convert a source document into PDF formation and the second type is used to convert PDF file content to text or HTML formats. A PDF file can be created from any source document or application like word, excel, PowerPoint and even an image. A PDF file can accommodate various types of data including text, hyperlinks, mathematical formulae, pictures, tables and charts. PDF to text/HTML converters are much used in situations where users have difficulty in reading the multiple column or small font documents and incapability of devices (like embedded systems) to handle PDF formats. However, designing such a converter is a challenging problem because of the various types of data in PDF files. A PDF-to-HTML Converter (PHC), apart from being fast and easy to use, should also have highest recognition accuracy and should be able to retain the format of the original document.

For this purpose, the existing PHC can be categorized into two domains, namely, text recognition and graphics recognition. Text recognition focus on the textual part of the PDF files and uses Optical Character Recognition (OCR) algorithms during conversion (Islam et al., 2009; Martinez-Alvarez et al., 2010). Graphics recognition, on the other hand, is focused on the lines and symbols of the PDF file. Graphics in PDF files include diagrams, maps, engineering drawing and scientific charts. Much of the work reported in literature focus on the first category, that is, text recognition. Knowledge mining from graphics is still very sparse, as though rich in information content are more complex and unwieldy to process than text.

Out of the various graphical objects, this paper recognizes one particular type of graphic objects, namely, scientific charts. Scientific charts are frequently embedded objects in PDF files and are used to convey a clear analysis of scientific or research results and commercial data trends. Scientific charts in PDF files can be viewed as an object composed of graphics and text elements arranged in a regular fashion. The graphics elements have simple syntactic and semantic rule constraints and can provide a concise representation for data analysis. Because of these, scientific charts are extensively used in many applications. While considering scientific applications, like research articles and e-journals, scientific charts and figures occupy more than 50% of the PDF file content (Shao & Futrelle, 2006). Scientific charts can be created using a variety of tools and programs and hence there exists lot of patterns to represent them. Examples include bar charts, line charts and pie charts. Thus, with the many different types available, a general chart recognition algorithm for PDF documents has become imperative. Chart recognition is concerned with the problem of identifying the chart type, conversion of information in chart into computer readable form. This paper is focused on the first problem, that is, to develop algorithm that has the ability to determine the type of a given chart image form a PDF file. For this purpose, the usage of ensemble classification is proposed. Classification, a frequently used data mining approach, is a process that separates a set of charts according to their visual content into one of the number of predefined categories. Each category is represented by a set of features and the classification algorithm maps these features to a class using machine learning algorithms. Ensemble classification, a method used to improve the classification accuracy, uses multiple classifiers for this purpose.

Complete Article List

Search this Journal:

Reset

Volume 14: 1 Issue (2024): Forthcoming, Available for Pre-Order

Volume 13: 1 Issue (2023)

Volume 12: 4 Issues (2022): 1 Released, 3 Forthcoming

Volume 11: 4 Issues (2021)

Volume 10: 4 Issues (2020)

Volume 9: 4 Issues (2019)

Volume 8: 4 Issues (2018)

Volume 7: 4 Issues (2017)

Volume 6: 2 Issues (2016)

Volume 5: 2 Issues (2015)

Volume 4: 2 Issues (2014)

Volume 3: 4 Issues (2013)

Volume 2: 4 Issues (2012)

Volume 1: 4 Issues (2011)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Ensemble Classification System for Scientific Chart Recognition from PDF Files

Abstract

1. Introduction

Complete Article List