Bangla and Oriya Script Lines Identification from Handwritten Document Images in Tri-script Scenario

Bangla and Oriya Script Lines Identification from Handwritten Document Images in Tri-script Scenario

Sk Md Obaidullah (Aliah University, Kolkata, India), Chayan Halder (West Bengal State University, Kolkata, India), Nibaran Das (Jadavpur University, Kolkata, India) and Kaushik Roy (West Bengal State University, Kolkata, India)
DOI: 10.4018/IJSSMET.2016010103


In this paper, two popular eastern Indian scripts namely Bangla and Oriya are considered for Line-level script identification considering two Tri-script groups where Devnagari and Roman are kept common in each group. A 27 dimensional feature vector has been constructed using FD (Fractal Dimension) and IMT (Interpolated Morphological Transform). 600 Line-level handwritten document images of each Tri-script groups have been considered for experimentation. Promising results has been found using multiple classifiers where MLP (Multi-Layer Perceptron) Neural Network and LMT (Logistic Model Tree) perform best for BDR (Bangla-Devnagari-Roman) combinations with 97% accuracy and LMT outperforms over others for ODR (Oriya-Devnagari-Roman) combinations with 97.7% accuracy. Bi-script performance analysis has also been made where combinations BR (Bangla-Roman) and BD (Bangla-Devnagari) results with accuracy of 98% and 97.5% respectively for the first group. Whereas for the second group OD (Oriya-Devnagari) and OR (Oriya-Roman) shows an accuracy of 98.25% and 98% respectively.
Article Preview

1. Introduction

India is a multi-lingual/multi-script country with 23 official languages (including English) and 13 different scripts (including Roman) are used to write those languages (Obaidullah et al. 2013; Ghosh et al. 2010). Multi-script documents are very common scenario in our country where a single document may be written by using more than one script types. In our daily life we come across various such multi-script documents like Postal Documents (Roy et al. 2004; Radha et al. 2014), Govt. Application Forms, Railway Reservation Forms etc. Another situation exists where collection of different documents written by different scripts need to be handle. Postal document sorting is one such example. To address these document processing problems there is a pressing need of development of sophisticated and smart techniques. Optical character recognition or in short OCR is an intelligent technique to convert image into its textual version. Normally a particular OCR is applicable for a particular script for which it was designed. As an example, Bangla OCR can process document images having Bangla/Asamese/Manipuri characters, Devnagari OCR can process Hindi/Marathi/Sanskrit etc. document images. But when it comes about multi-script document containing more than one scripts say Bangla and Devnagari both then a single OCR fails to process such document images. Not only that, if a single document is written by single script then manual intervention is required to choose a specific OCR for a specific document image. This manual intervention leads to huge manpower and resource loss at current scenario. To overcome the problem one feasible solution is to develop an automatic script identification system for all official Indic scripts first then choosing appropriate script specific OCR to process document image written by a specific script. Present work emphasizes to develop an approach for Line-level script identification technique observing its usefulness in many real life document images where script differentiation is present at line wise. Four eastern Indian scripts namly Bangla, Oriya, Devnagari and Roman are considered for the present work. Brief discussions about these four scripts are presented in the following section.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 12: 6 Issues (2021): Forthcoming, Available for Pre-Order
Volume 11: 4 Issues (2020): 3 Released, 1 Forthcoming
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing