Article Preview
Top1. Introduction
The work of multi-script document processing has an importance for a country like India, where 23 different official languages (including English) are present and 11 different scripts are used to write them. We handle various multi-script documents in our day to day life and automatic processing of those documents is a pressing need in this digital era. In general, OCRs are script specific, and processing documents having more than one script is not easy. Therefore, one of the common/suggested solutions is to develop a script identification system (SIS), so that we can take it as a precursor to the specific OCR. To adress this issue, in this paper, we present a database that is composed of 13 different languages from 11 different scripts (having fairly large amount words in it) for automatic script identification in multi-script documents. Figure 1 shows a general block diagram for multi-script document processing system.
Figure 1. Block diagram of multi-script document processing system
Since last decade, researchers have addressed Indic script identification problem. But very few works are available till date. Among them, one work has been reported (Pati et al. 2008), which consider 11 different scripts in their study. They have used a database from 11 different languages, where two languages: Kashmiri and Dogri originating from Northan part of India were not considered. To represent the scripts, two texture based features namely Gabor filter and directional cosine transform (DCT) based frequency domain techniques were used. Based on these features, their reported performances are 98% for bi-Script and tri-Script, and 89% for eleven-scripts by using three different classifiers: nearest neighbor, linear discriminative and support vector machine (SVM). Since then, this can be considered as a benchmark work on printed script identification (PSI) at word level.
Considering other popular works on Indic and non-Indic scripts, a textual feature based technique has been proposed (Hochberg et al. 1997) and tested on identify six different scripts: Arabic, Armenian, Devanagari, Chinese, Cyrillic, and Burmese. This is one of the earlier works addressing the script identification problem. In this work, only one Indic script was considered along with five non-Indic scripts.
An attempt for line level script identification (Pal et al. 2002) was proposed and tested on five different scripts: Bangla, Devanagari, Chinese, Arabic and Roman.
A technique (Jahawar et al. 2003) using headline and contextual information based features was proposed to identify Devanagari and Telugu scripts. In addition with that, PCA was used to reduce the feature vector size and the classification was done using SVM.
A Gabor energy based technique with k-nearest neighbor (k-NN) classifier (Joshi et al. 2006) for paragraph level script identification was proposed and tested on ten different Indic scripts. Comparing with the earlier works, the number of scripts considered in this work was good enough covering most of the official Indic scripts.
A technique (Dhanya et al. 2002) using using Gabor filter based directional feature and SVM classifier was proposed to separate Tamil and Roman scripts.
A method for script identification by combining trainable classifiers has been proposed (Chaudhury et al. 2000) and tested on six different scripts: Devanagari, Telugu, Roman, Malayalam, Bangla and Urdu.
In the script identification review paper (Ghosh et al. 2010; Singh et al. 2015), authors pointed out the unavailability issue of benchmark works by considering all official Indic scripts. Following this review, we are, indeed, motivated to publish a benchmark database and results considering all 13 official Indic scripts.