Article Preview
TopIntroduction
Hindi Document Mining (HDM) refers to the search and retrieval of important relevant information from Hindi documents, while ignoring the irrelevant ones, so that this information can be used to classify the unknown text documents into different groups. Hindi Text Document Classification (HTDC), an extension of HDM, is used to classify the documents into pre-specified and labeled, mutually exclusive categories by first providing them training and then performing the system testing and classification on them (Puri, 2011; Puri & Kaushik, 2012). The complexity of such automatic and intelligent Hindi textual classification systems increases when they need to classify the scanned printed and handwritten text documents into pre-defined classes. The training, testing and classification of these documents are very much crucial and challenging, and therefore, require major attention. The document type ranges from simple Hindi text to the scanned versions of printed and handwritten text documents (Bag & Harit, 2013; Bagadkar & Malik, 2014; Chaudhari & Pal, 1997; Chau & Yeh, 2002; Hole & Ragha, 2011; Jayadevan, Kolhe, Patil, & Pal, 2011b; Kumar, Holambe, Thool, & Jagade, 2012; Nevetha & Baskar, 2015; Puri & Singh, 2016; Pal, Wakabayashi, & Kimura, 2009). Looking insight the application domains and to analyze today’s scenario of current existing Devanagari and Hindi based systems, the scope of such systems is found limited only to script recognition and discrimination (Hassan, Garg, Chaudhury, & Gopal, 2011; Kumar et al., 2012); identification, recognition and Shirorekha removal at character and word (keyword) levels (Shinde, & Dandawate, 2014); text summarization; recognition and separation of objects of multi – color, – font, – orientation and – size (Singh, Mittal, & Ghosh, 2014); and separation of images from non – images.