A Hybrid Hindi Printed Document Classification System Using SVM and Fuzzy: An Advancement

A Hybrid Hindi Printed Document Classification System Using SVM and Fuzzy: An Advancement

Shalini Puri (Birla Institute of Technology, Mesra, Ranchi, India) and Satya Prakash Singh (Birla Institute of Technology, Mesra, Ranchi, India)
Copyright: © 2019 |Pages: 25
DOI: 10.4018/JITR.2019100106

Abstract

This article introduces a new advanced tri-layered segmentation and bi-leveled-classifier-based Hindi printed document classification system, which categorizes imaged documents into pre-defined mutually exclusive categories by using SVM and Fuzzy matching at character and document classifications, respectively. During training, the improved and noise-free image is segmented into lines and words by profiling. Then it obtains Shirorekha Less (SL) isolated characters along with upper, left and right modifier components from the SL words. These components use their locations and inter character-modifier component distance to get associate with their corresponding characters only. Further, confidence values of all characters are calculated with SVM training and all characters are mapped into Romanized labels to generate the words. Finally, documents are classified by Fuzzy based matching of Romanized detected words and predefined classes. The average execution times of SL characters are 0.22675 sec. and 0.20375 sec. and classification accuracy are 74.61% and 80.73% for training and testing, respectively.
Article Preview
Top

Introduction

Over past two decades, with the tremendous advent, evolution, digitization and continuous growth in analysis of printed documents, Devanagari script and Hindi based processing systems have established their consistent and framed zone for information searching, extraction and retrieval from text and imaged documents. Although the text processing and accurate information retrieval (Puri & Kaushik, 2011; Puri & Kaushik, 2012) from Hindi printed scanned documents (Puri & Singh, 2018) have always been very complicated and challenging, yet it has achieved a great deal of success in accurate word and character recognition and also has got high level of researchers’ attention in recent days. In this article, a new automated Hindi Printed Document Classification System using Support Vector Machine and Fuzzy logic (HPDC-SF) is introduced, which proves to be an efficient advancement over currently available offline Hindi document processing systems. HPDC-SF is designed to classify scanned printed imaged documents into pre – defined mutually exclusive categories by using Support Vector Machine (SVM) at character level and Fuzzy matching at document level classification, respectively.

Many Hindi based processing systems have emerged in recent years through the combination of artificial intelligence (Padhy, 2005), pattern recognition, image processing (Gonzalez & Woods, 2008) and text mining (Han, Kamber, & Pei, 2012) concepts. These systems have contributed a lot towards the discrete and dynamic real time application areas of distributed environment. The automatic Hindi text processing system applications cover text syntax and semantics, editors, spell checkers, formatters, linguistics-based grammar and vocabulary, convertors, translators, transliteration, summarization, speech recognition with conversion, cross lingual and many other related fields. On the other side, many Hindi text imaged document methodologies have emerged in recent years (Puri & Singh, 2018; Sinha 2009), which have covered the areas of extraction and recognition of optical characters, words and lines in multi – script, multi – colored, multi – forms, multi – pattern, multi – oriented, multi – font and multi – sized documents. Therefore, it is found that there is a high need to design an advanced imaged document processing and classification system, which can work beyond Optical Character Recognition (OCR). Such systems need to build the words from extracted optical characters, to gather the image contents, and to classify the Hindi printed images optimally. Accuracy estimation of such systems is a major and highly critical aspect because only correct OCRing, word building, and effective classifier implementation can lead to accurate classification of Hindi printed images (Puri & Singh, 2018). The application areas of these automated document processing systems include categorization of Government legal files, security files, identification of property owners etc. In addition to this, they play a major role in separating the important text images from non-important ones. To estimate the measures and efficiency of HPDC-SF, various experiments have been performed on different types of Hindi printed images, which were collected from different Government sites, newsletters, novels, magazines, blogs, newspaper cuttings etc.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 13: 4 Issues (2020): 2 Released, 2 Forthcoming
Volume 12: 4 Issues (2019)
Volume 11: 4 Issues (2018)
Volume 10: 4 Issues (2017)
Volume 9: 4 Issues (2016)
Volume 8: 4 Issues (2015)
Volume 7: 4 Issues (2014)
Volume 6: 4 Issues (2013)
Volume 5: 4 Issues (2012)
Volume 4: 4 Issues (2011)
Volume 3: 4 Issues (2010)
Volume 2: 4 Issues (2009)
Volume 1: 4 Issues (2008)
View Complete Journal Contents Listing