Visual Speech Processing and Recognition

Visual Speech Processing and Recognition

Constantine Kotropoulos (Aristotle University of Thessaloniki, Greece) and Ioannis Pitas (Aristotle University of Thessaloniki, Greece)
Copyright: © 2009 |Pages: 33
DOI: 10.4018/978-1-60566-186-5.ch009


This chapter addresses both low- and high-level problems in visual speech processing and recognition In particular, mouth region segmentation and lip contour extraction are addressed first. Next, visual speech recognition with parallel support vector machines and temporal Viterbi lattices is demonstrated on a small vocabulary task.
Chapter Preview


Audio-visual speech recognition is an emerging research field, where multi-modal signal processing is required. The motivation for using the visual information in performing speech recognition lays on the fact that the human speech production is bimodal by its nature (Campbell, 1998; Massaro, 1987; Reisberg, 1987; Sumby, 1954; Summerfield, 1987). Although human speech is produced by the vibration of the vocal cords, it depends also on articulators that are partly visible, such as the tongue, the teeth, and the lips. Furthermore, the muscles that generate the facial expressions are also employed in speech production. Consequently, speech can be partially recognized from the information of the visible articulators involved in its production and in particular the image region comprising the mouth (Benoît, 1992; Chen, 1998; Chen, 2001).

Undoubtedly, the acoustic signal carries the most useful information for speech recognition. However, when speech is degraded by noise, integrating the visual information with the acoustic one reduces significantly the word error rate (Lombard, 1911; McGurk, 1976). Indeed, under noisy conditions, it has been proved that the use of both modalities in speech recognition offers an equivalent gain of 12 dB to the signal-to-noise ratio of the acoustic signal (Chen, 2001). For large vocabulary speech recognition tasks, the visual signal can yield a performance gain, when it is integrated with the acoustic signal, even for clean acoustic speech (Neti, 2001). It is worth noting that lipreading cannot replace the normal auditory function, because its largest weakness is the difficulty of interpreting voicing, prosody, and the manner of production of consonants (Ebrahimi, 1991).

Despite the variety of existing methods in visual speech processing and recognition (Stork, 1996; Neti 2002; Potamianos, 2003; Aleksic, 2006) there is still ongoing research attempting to: 1) find the most suitable features and classification techniques to discriminate effectively between the different mouth shapes, while preserving the mouth shapes produced by different individuals that correspond to one phone in the same class; 2) require minimal processing of the mouth image, to allow for a real time implementation of mouth detection, lip contour extraction, and mouth shape classifier; and 3) facilitate the easy integration of audio and video speech recognition modules. This chapter addresses both low and high level problems in visual speech processing and recognition summarizing and extending past results (Gordan 2001; Gordan 2002a; Gordan 2002b) and contributing to just-mentioned points 1) and 2). Mouth region segmentation is described first. Next, lip contour extraction is discussed. Finally, an SVM-based approach to visual speech recognition with a dynamic network is studied.

Complete Chapter List

Search this Book: