Speechreading using Modified Visual Feature Vectors

Speechreading using Modified Visual Feature Vectors

Preety Singh (Malaviya National Institute of Technology, India), Vijay Laxmi (Malaviya National Institute of Technology, India) and M. S. Gaur (Malaviya National Institute of Technology, India)
DOI: 10.4018/978-1-4666-2169-5.ch012
OnDemand PDF Download:
No Current Special Offers


Audio-Visual Speech Recognition (AVSR) is an emerging technology that helps in improved machine perception of speech by taking into account the bimodality of human speech. Automated speech is inspired from the fact that human beings subconsciously use visual cues to interpret speech. This chapter surveys the techniques for audio-visual speech recognition. Through this survey, the authors discuss the steps involved in a robust mechanism for perception of speech for human-computer interaction. The main emphasis is on visual speech recognition taking only the visual cues into account. Previous research has shown that visual-only speech recognition systems pose many challenges. The authors present a speech recognition system where only the visual modality is used for recognition of the spoken word. Significant features are extracted from lip images. These features are used to build n-gram feature vectors. Classification of speech using these modified feature vectors results in improved accuracy of the spoken word.
Chapter Preview


Automatic Speech Recognition (ASR) technology allows a computer to recognize words that a person speaks. ASR systems have been shown to give word recognition rates of 98-99% in controlled environments where there is a single speaker, a microphone in close proximity and minimal background noise. In real-life, ideal situations like these are not feasible and the automatic speech recognition techniques show a marked degradation in performance. Psycholinguistic research has shown that speech perception is improved by visual cues (Dodd & Campbell, 1987). While the auditory signal degrades as the Signal-to-Noise Ratio (SNR) decreases, the visual signal is not affected by it. By incorporating a visual input along with the audio signal, there is significant improvement in the noise robustness of speech recognition. This is referred to as Audio-Visual Speech Recognition (AVSR). This can improve the intelligibility of speech in a multi-speaker environment or in a noisy background. In cases when an accented speech or a foreign language is to be understood, it is easy to identify the linguistic message if it is accompanied by visual cues (Sumby & Pollack, 1954).

The production and perception of human speech is bimodal in nature. It has also been shown that addition of visual information is equivalent to about 12 dB gain in SNR (Chen, 2001). Though visual information might not prove to be very beneficial for a small vocabulary set, it plays an important role for a large vocabulary or in presence of noise. Visual cues help in localization of the audio source. They supplement the audio signal by providing speech segmental information and also supply important information about the position of articulators (Potamianos, Neti, Gravier, Garg, & Senior, 2003).

Though humans have an inherent ability to lip-read, it is difficult to train computers in the same manner. Interpretation of speech by man is improved because of his knowledge of the context of the conversation and facial movements. People having hearing disabilities rely heavily on lip-reading and the movement of eyebrows, cheeks, and chin aids their comprehension of speech. While observation of these articulatory gestures comes naturally to man and helps him to abstract visual cues to identify speech, machines have not been able to perform similarly. Lip-reading systems usually take sequence of lip images as visual input but lip segmentation itself is a challenging task. Lot of research has been done in the field of visual speech recognition. However, the accuracy achieved has been around 40% only (Matthews, Cootes, Bangham, Cox, & Harvey, 2002).

Production of sound involves speech articulators, some of which may be visible (lips, tongue, and teeth) and some may not (velum, vocal cord, nasal tract, etc.). Various speech articulators affect the speech that is produced. Visible articulators, participating in the modulation of sound wave are known as primary indicators for visual speech and are most important. Image sequences of speech show that the tongue and teeth are partially visible and their individual extraction is not always feasible. The cheeks, chin, and nose are used as secondary indicators. The linguistic message can be decoded by observation of some of the articulatory movements that produce the acoustic signal. This can improve auditory speech perception. The place of articulators can help in distinguishing between ambiguous sounds, for example, /p/ (a bilabial) and /k/ (a velar), /b/ (a bilabial) and /d/ (an alveolar), /m/ (a bilabial) and /n/ (an alveolar). These three pairs are frequent cause of acoustic confusion for humans unless aided by visual input or contextual information.

The distinctive sounds produced are known as phonemes and the specific shape of the mouth while producing this sound is called viseme. For the English alphabet, the ARPABET table (Shoup, 1980), consisting of 48 phonemes is generally used for classification though there is no standard viseme table. These phonemes and visemes are in context of human perception and not computer perception of speech.

Complete Chapter List

Search this Book: