Arabic Phonetic Dictionaries for Speech Recognition

Arabic Phonetic Dictionaries for Speech Recognition

Mohamed Ali (King Fahd University of Petroleum and Minerals, Saudi Arabia), Moustafa Elshafei (King Fahd University of Petroleum and Minerals, Saudi Arabia), Mansour Al-Ghamdi (King Abdulaziz City of Science and Technology, Saudi Arabia) and Husni Al-Muhtaseb (King Fahd University of Petroleum and Minerals, Saudi Arabia)
Copyright: © 2009 |Pages: 14
DOI: 10.4018/jitr.2009062905

Abstract

Phonetic dictionaries are essential components of large-vocabulary speaker-independent speech recognition systems. This paper presents a rule-based technique to generate phonetic dictionaries for a large vocabulary Arabic speech recognition system. The system used conventional Arabic pronunciation rules, common pronunciation rules of Modern Standard Arabic, as well as some common dialectal cases. The paper gives in detail an explanation of these rules as well as their formal mathematical presentation. The rules were used to generate a dictionary for a 5.4 hour corpus of broadcast news. The rules and the phone set were tested and evaluated on an Arabic speech recognition system. The system was trained on 4.3 hours of the 5.4 hours of Arabic broadcast news corpus and tested on the remaining 1.1 hours. The phonetic dictionary contains 23,841 definitions corresponding to about 14232 words. The language model contains both bi-grams and tri-grams. The Word Error Rate (WER) came to 9.0%.
Article Preview

Introduction

Automatic Speech Recognition (ASR) is a key technology for a variety of industrial and IT applications; and it extends the reach of IT across people as well as applications. Automatic Speech Recognition (ASR) is gaining a growing role in a variety of applications, such as hands-free operation and control (as in cars and airplanes), automatic query answering, telephone communication with information systems, automatic dictation (speech-to-text transcription), government information systems, etc. In fact, speech communication with computers, PCs, and household appliances is envisioned to be the dominant human-machine interface in the near future. In spite of the tangible success of this technology for the English language and other languages, there are still many issues for Arabic language that need to be addressed by researchers to catch up with the progress of the ASR technology in the other languages.

One of the key components of the modern large-vocabulary speech recognition systems is the pronunciation or phonetic dictionary. This dictionary serves as an intermediary between the Acoustic Model and the Language Model in speech recognition systems. It contains a subset of the words available in the language and the pronunciation of each word in terms of the phonemes or the allophones available in the acoustic model.

For instance, the CMU dictionary for North American English contains over 125,000 words and their transcriptions (CMU, 2008). The format of this dictionary is particularly useful for speech recognition and synthesis, as it has mappings from words to their pronunciations in the given phoneme set The current phoneme set contains 39 English phonemes, for which the vowels may also carry lexical stress. Because of the large number of pronunciation exceptions in English, this dictionary was essentially built manually by experts over many years.

On the other hand, pronunciation of Arabic text follows specific rules when the text is fully diacritized. Many of these pronunciation rules can be found in Elshafei (1991), and Alghamdi el. al. (2004).

The statistical approach for speech recognition (Huang etal, 2001; Jelinek, 1998; Rabiner & Juang, 1993) has virtually dominated Automatic Speech Recognition (ASR) research over the last few decades, leading to a number of successes (Lee, 1988; Soltau et al, 2007; Stallard et al., 2008; Young, 1997; Zhou et .al, 2003). The statistical approach is dominated by the powerful statistical technique called Hidden Markov Model (HMM) (Rabiner 1989). The HMM-based ASR technique allowed to build many successful applications that depend on large vocabulary speaker-independent continuous speech recognition.

The HMM-based technique essentially consists of recognizing speech by estimating the likelihood of each phoneme at contiguous, small frames of the speech signal (Huang et al., 2001; Rabiner & Juang, 1993). Words in the target vocabulary are modeled into a sequence of phonemes, and then a search procedure is used to find, amongst the words in the vocabulary list, the phoneme sequence that best matches the sequence of phonemes of the spoken word.

Two notable successes in the academic community in developing high performance large vocabulary speaker independent speech recognition systems are the HMM tools, known as the HTK tool kit, developed at Cambridge University, (HTK, 2007), and the Sphinx system developed at Carnegie Mellon University (Huang et al., 1993; Lamere et al, 2003; Noamany et al., 2007; Placeway et al., 1997).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 12: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 11: 4 Issues (2018)
Volume 10: 4 Issues (2017)
Volume 9: 4 Issues (2016)
Volume 8: 4 Issues (2015)
Volume 7: 4 Issues (2014)
Volume 6: 4 Issues (2013)
Volume 5: 4 Issues (2012)
Volume 4: 4 Issues (2011)
Volume 3: 4 Issues (2010)
Volume 2: 4 Issues (2009)
Volume 1: 4 Issues (2008)
View Complete Journal Contents Listing