Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

Hindi Text Document Classification System Using SVM and Fuzzy: A Survey

Shalini Puri (Birla Institute of Technology, Ranchi, India) and Satya Prakash Singh (Birla Institute of Technology, Ranchi, India)
Copyright: © 2018 |Pages: 31
DOI: 10.4018/IJRSDA.2018100101

Abstract

In recent years, many information retrieval, character recognition, and feature extraction methodologies in Devanagari and especially in Hindi have been proposed for different domain areas. Due to enormous scanned data availability and to provide an advanced improvement of existing Hindi automated systems beyond optical character recognition, a new idea of Hindi printed and handwritten document classification system using support vector machine and fuzzy logic is introduced. This first pre-processes and then classifies textual imaged documents into predefined categories. With this concept, this article depicts a feasibility study of such systems with the relevance of Hindi, a survey report of statistical measurements of Hindi keywords obtained from different sources, and the inherent challenges found in printed and handwritten documents. The technical reviews are provided and graphically represented to compare many parameters and estimate contents, forms and classifiers used in various existing techniques.
Article Preview
Top

Introduction

Hindi Document Mining (HDM) refers to the search and retrieval of important relevant information from Hindi documents, while ignoring the irrelevant ones, so that this information can be used to classify the unknown text documents into different groups. Hindi Text Document Classification (HTDC), an extension of HDM, is used to classify the documents into pre-specified and labeled, mutually exclusive categories by first providing them training and then performing the system testing and classification on them (Puri, 2011; Puri & Kaushik, 2012). The complexity of such automatic and intelligent Hindi textual classification systems increases when they need to classify the scanned printed and handwritten text documents into pre-defined classes. The training, testing and classification of these documents are very much crucial and challenging, and therefore, require major attention. The document type ranges from simple Hindi text to the scanned versions of printed and handwritten text documents (Bag & Harit, 2013; Bagadkar & Malik, 2014; Chaudhari & Pal, 1997; Chau & Yeh, 2002; Hole & Ragha, 2011; Jayadevan, Kolhe, Patil, & Pal, 2011b; Kumar, Holambe, Thool, & Jagade, 2012; Nevetha & Baskar, 2015; Puri & Singh, 2016; Pal, Wakabayashi, & Kimura, 2009). Looking insight the application domains and to analyze today’s scenario of current existing Devanagari and Hindi based systems, the scope of such systems is found limited only to script recognition and discrimination (Hassan, Garg, Chaudhury, & Gopal, 2011; Kumar et al., 2012); identification, recognition and Shirorekha removal at character and word (keyword) levels (Shinde, & Dandawate, 2014); text summarization; recognition and separation of objects of multi – color, – font, – orientation and – size (Singh, Mittal, & Ghosh, 2014); and separation of images from non – images.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 8: 4 Issues (2021): Forthcoming, Available for Pre-Order
Volume 7: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 6: 4 Issues (2019): 3 Released, 1 Forthcoming
Volume 5: 4 Issues (2018)
Volume 4: 4 Issues (2017)
Volume 3: 4 Issues (2016)
Volume 2: 2 Issues (2015)
Volume 1: 2 Issues (2014)
View Complete Journal Contents Listing