Efficient Pronunciation Assessment of Taiwanese-Accented English Based on Unsupervised Model Adaptation and Dynamic Sentence Selection

Efficient Pronunciation Assessment of Taiwanese-Accented English Based on Unsupervised Model Adaptation and Dynamic Sentence Selection

Chung-Hsien Wu (National Cheng Kung University, Taiwan, R.O.C.), Hung-Yu Su (National Cheng Kung University, Taiwan, R.O.C.) and Chao-Hong Liu (National Cheng Kung University, Taiwan, R.O.C.)
DOI: 10.4018/978-1-4666-1830-5.ch002
OnDemand PDF Download:
No Current Special Offers


This chapter presents an efficient approach to personalized pronunciation assessment of Taiwanese-accented English. The main goal of this study is to detect frequently occurring mispronunciation patterns of Taiwanese-accented English instead of scoring English pronunciations directly. The proposed assessment help quickly discover personalized mispronunciations of a student, thus English teachers can spend more time on teaching or rectifying students’ pronunciations. In this approach, an unsupervised model adaptation method is performed on the universal acoustic models to recognize the speech of a specific speaker with mispronunciations and Taiwanese accent. A dynamic sentence selection algorithm, considering the mutual information of the related mispronunciations, is proposed to choose a sentence containing the most undetected mispronunciations in order to quickly extract personalized mispronunciations. The experimental results show that the proposed unsupervised adaptation approach obtains an accuracy improvement of about 2.1% on the recognition of Taiwanese-accented English speech.
Chapter Preview


Pronunciation is a difficult part for language learners to be proficient in Second Languages (L2) due to the influence of First Languages (L1) and thus require necessary instructions from professional personnel. Computer Assisted Language Learning (CALL) systems (Menzel et al. 2000) (Mak et al. 2003) were introduced to provide automatic learning and evaluation for the needs of L2 learners (Derwing et al. 2000)(Coniam et al. 1999)(Kalikow et al. 1972). Computer Assisted Pronunciation Training (CAPT) is an important topic of CALL, focusing on rectifying pronunciation errors. To evaluate pronunciation proficiency of non-native speakers, pronunciation scoring is mostly used in CAPT systems to score the speech input. On the other hand, mispronunciation detection also provides additional useful information for L2 learners, e.g., local pronunciation mistakes might be more helpful for pronunciation rectification compared to a global score for a whole sentence.

In the past decade, automatic speech recognition (ASR) systems were employed to score the pronunciation in CAPT systems. SPELL(Hiller et al. 1993) used word pairs for pronunciation scoring at phone level to assess and improve L1 pronunciation in modules for teaching consonant production, vowel quality, rhythm and intonation, in three European languages (English, French and Italian). Hamada el al. (Hamada et al. 1993) used dynamic programming and vector quantization to compare non-native utterance with native recoding at word level. The capacity of text-dependent approaches was restricted because training materials cannot be updated without other new utterances from native speakers. ASR using hidden Markov models (HMMs) has also been adopted for scoring a whole sentence instead of smaller pieces. In an HMM, the state is not directly visible, but output, dependent on the state, is visible (Baum et al.1967). Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. HMMs have been known for their application in temporal pattern recognition such as speech, handwriting, and gesture recognition. Neumeyer et al. (Neumeyer et al. 1996) used HMM log-likelihood, segment duration and timing for pronunciation scoring of the whole sentence. Eskenazi (Eskenazi, 1996) used HMM log-likelihood to score the pronunciations of non-native speech compared to the corresponding native speech. For further analysis on pronunciation, the goodness of pronunciation (GOP) measure (Witt and Young, 2000) was proposed to score each phoneme in an utterance based on HMM likelihood and several studies brought other features or methods, such as phoneme posterior score, duration, and speech rate, for phone-level pronunciation scoring (Franco et al. 2000) (Mak et al. 2003)(Neri et al. 2006a)(Nakagawa et al. 2003).

Complete Chapter List

Search this Book: