Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification

Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification

S. K. Obaidullah (Department of Computer Science and Engineering, Kolkata, India), K. C. Santosh (Department of Computer Science, The University of South Dakota, Vermillion, SD, USA), Chayan Halder (West Bengal State University, Kolkata, India), Nibaran Das (Jadavpur University, Kolkata, India) and Kaushik Roy (West Bengal State University, Kolkata, India)
Copyright: © 2017 |Pages: 14
DOI: 10.4018/IJCVIP.2017040106
OnDemand PDF Download:
List Price: $37.50


Document analysis research starves from the availability of public datasets. Without publicly available dataset, one cannot make fair comparison with the state-of-the-art methods. To bridge this gap, in this paper, the authors propose a word-level document image dataset of 13 different Indic languages from 11 official scripts. It is composed of 39K words that are equally distributed i.e., 3K words per language. For a baseline results, five different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA), simple logistic (SL), library for linear classifier (LibLINEAR) and bayesian network (BayesNet) classifiers are used with three state-of-the-art features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations. The authors observed that MLP provides better results when all features are used, and achieved the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).
Article Preview

1. Introduction

The work of multi-script document processing has an importance for a country like India, where 23 different official languages (including English) are present and 11 different scripts are used to write them. We handle various multi-script documents in our day to day life and automatic processing of those documents is a pressing need in this digital era. In general, OCRs are script specific, and processing documents having more than one script is not easy. Therefore, one of the common/suggested solutions is to develop a script identification system (SIS), so that we can take it as a precursor to the specific OCR. To adress this issue, in this paper, we present a database that is composed of 13 different languages from 11 different scripts (having fairly large amount words in it) for automatic script identification in multi-script documents. Figure 1 shows a general block diagram for multi-script document processing system.

Figure 1.

Block diagram of multi-script document processing system

Since last decade, researchers have addressed Indic script identification problem. But very few works are available till date. Among them, one work has been reported (Pati et al. 2008), which consider 11 different scripts in their study. They have used a database from 11 different languages, where two languages: Kashmiri and Dogri originating from Northan part of India were not considered. To represent the scripts, two texture based features namely Gabor filter and directional cosine transform (DCT) based frequency domain techniques were used. Based on these features, their reported performances are 98% for bi-Script and tri-Script, and 89% for eleven-scripts by using three different classifiers: nearest neighbor, linear discriminative and support vector machine (SVM). Since then, this can be considered as a benchmark work on printed script identification (PSI) at word level.

Considering other popular works on Indic and non-Indic scripts, a textual feature based technique has been proposed (Hochberg et al. 1997) and tested on identify six different scripts: Arabic, Armenian, Devanagari, Chinese, Cyrillic, and Burmese. This is one of the earlier works addressing the script identification problem. In this work, only one Indic script was considered along with five non-Indic scripts.

An attempt for line level script identification (Pal et al. 2002) was proposed and tested on five different scripts: Bangla, Devanagari, Chinese, Arabic and Roman.

A technique (Jahawar et al. 2003) using headline and contextual information based features was proposed to identify Devanagari and Telugu scripts. In addition with that, PCA was used to reduce the feature vector size and the classification was done using SVM.

A Gabor energy based technique with k-nearest neighbor (k-NN) classifier (Joshi et al. 2006) for paragraph level script identification was proposed and tested on ten different Indic scripts. Comparing with the earlier works, the number of scripts considered in this work was good enough covering most of the official Indic scripts.

A technique (Dhanya et al. 2002) using using Gabor filter based directional feature and SVM classifier was proposed to separate Tamil and Roman scripts.

A method for script identification by combining trainable classifiers has been proposed (Chaudhury et al. 2000) and tested on six different scripts: Devanagari, Telugu, Roman, Malayalam, Bangla and Urdu.

In the script identification review paper (Ghosh et al. 2010; Singh et al. 2015), authors pointed out the unavailability issue of benchmark works by considering all official Indic scripts. Following this review, we are, indeed, motivated to publish a benchmark database and results considering all 13 official Indic scripts.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing