Arabic Optical Character Recognition: Recent Trends and Future Directions

Husni Al-Muhtaseb (King Fahd University of Petroleum and Minerals, Saudi Arabia) and Rami Qahwaji (University of Bradford, UK)
DOI: 10.4018/978-1-60960-477-6.ch019


Arabic text recognition is receiving more attentions from both Arabic and non-Arabic-speaking researchers. This chapter provides a general overview of the state-of-the-art in Arabic Optical Character Recognition (OCR) and the associated text recognition technology. It also investigates the characteristics of the Arabic language with respect to OCR and discusses related research on the different phases of text recognition including: pre-processing and text segmentation, common feature extraction techniques, classification methods and post-processing techniques. Moreover, the chapter discusses the available databases for Arabic OCR research and lists the available commercial Software. Finally, it explores the challenges related to Arabic OCR and discusses possible future trends.
Characteristics Of Arabic Text

Arabic is a cursive language written from right to left. It has 28 basic alphabets. An Arabic letter might have up to four different shapes depending on the position of the letter in the word: whether it is a standalone letter, connected only from right (initial form), connected only from left (terminal form), or connected from both sides (medial form). Letters of a word may overlap vertically (even without touching).

Arabic letters do not have fixed size (height and width). Letters in a word can have diacritics (short vowels) such as Fat-hah, Dhammah, Shaddah, sukoon and Kasrah. Moreover, Tanween may be formed by having double Fat-hah, double Dhammah, or double Kasrah.Figure 1 lists these diacritics. These diacritics are written as strokes, placed either on top of, or below, the letters. A different diacritic on a letter may change the meaning of a word. Readers of Arabic are used to reading un-vocalized text by deducing the meaning from context.

Figure 1.

Arabic short vowels (diacritics)

Figure 2 shows some of the characteristics of Arabic text. It shows a base line, overlapping letters, diacritics, and two shapes of Noon character (initial and medial).

Figure 2.

An example of an Arabic sentence indicating some characteristics of Arabic text

