Lip Feature Extraction and Feature Evaluation in the Context of Speech and Speaker Recognition

Lip Feature Extraction and Feature Evaluation in the Context of Speech and Speaker Recognition

Petar S. Aleksic (Google Inc., USA) and Aggelos K. Katsaggelos (Northwestern University, USA)
Copyright: © 2009 |Pages: 31
DOI: 10.4018/978-1-60566-186-5.ch002
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

There has been significant work on investigating the relationship between articulatory movements and vocal tract shape and speech acoustics (Fant, 1960; Flanagan, 1965; Narayanan & Alwan, 2000; Schroeter & Sondhi, 1994). It has been shown that there exists a strong correlation between face motion, and vocal tract shape and speech acoustics (Grant & Braida, 1991; Massaro & Stork, 1998; Summerfield, 1979, 1987, 1992; Williams & Katsaggelos, 2002; Yehia, Rubin, & Vatikiotis-Bateson, 1998). In particular, dynamic lip information conveys not only correlated but also complimentary information to the acoustic speech information. Its integration into an automatic speech recognition (ASR) system, resulting in an audio-visual (AV) system, can potentially increase the system’s performance. Although visual speech information is usually used together with acoustic information, there are applications where visual-only (V-only) ASR systems can be employed achieving high recognition rates. Such include small vocabulary ASR (digits, small number of commands, etc.) and ASR in the presence of adverse acoustic conditions. The choice and accurate extraction of visual features strongly affect the performance of AV and V-only ASR systems. The establishment of lip features for speech recognition is a relatively new research topic. Although a number of approaches can be used for extracting and representing visual lip information, unfortunately, limited work exists in the literature in comparing the relative performance of different features. In this chapter, the authors describe various approaches for extracting and representing important visual features, review existing systems, evaluate their relative performance in terms of speech and speaker recognition rates, and discuss future research and development directions in this area.
Chapter Preview
Top

Introduction

Significant interest and effort has been focused over the past decades on exploiting the visual modality in order to improve human-computer interaction (HCI), but also on the automatic recognition of visual speech (video sequences of the face or mouth area), also known as automatic lipreading, and its integration with traditional audio-only systems, giving rise to AV ASR (Aleksic, Potamianos, & Katsaggelos, 2005; Aleksic, Williams, Wu, & Katsaggelos, 2002; Chen, 2001; Chen & Rao, 1998; Chibelushi, Deravi, & Mason, 2002; Dupont & Luettin, 2000; Oviatt, et al., 2000; Petajan, 1985; Potamianos, Neti, Gravier, Garg, & Senior, 2003; Potamianos, Neti, Luettin, & Matthews, 2004; Schroeter, et al., 2000; Stork & Hennecke, 1996). The successes in these areas form another basis for exploiting the visual information in the speaker recognition problem (J. P. Campbell, 1997; Jain, Ross, & Prabhakar, 2004; Jain & Uludag, 2003; Ratha, Senior, & Bolle, 2001; Unknown, 2005), thus giving rise to AV speaker recognition (Abdeljaoued, 1999; Aleksic & Katsaggelos, 2003, 2006; Basu, et al., 1999; Ben-Yacoub, Abdeljaoued, & Mayoraz, 1999; Bengio, 2003, 2004; Bigun, Bigun, Duc, & Fisher, 1997; Brunelli & Falavigna, 1995; Brunelli, Falavigna, Poggio, & Stringa, 1995; Chaudhari & Ramaswamy, 2003; Chaudhari, Ramaswamy, Potamianos, & Neti, 2003; Chibelushi, Deravi, & Mason, 1993, 1997; Dieckmann, Plankensteiner, & Wagner, 1997; Erzin, Yemez, & Tekalp, 2005; Fox, Gross, Cohn, & Reilly, 2005; Fox, Gross, de Chazal, Cohn, & Reilly, 2003; Fox & Reilly, 2003; Frischolz & Dieckmann, 2000; T. J. Hazen, Weinstein, Kabir, Park, & Heisele, 2003; Hong & Jain, 1998; Jourlin, Luettin, Genoud, & Wassner, 1997a, 1997b; Kanak, Erzin, Yemez, & Tekalp, 2003; Kittler, Hatef, Duin, & Matas, 1998; Kittler, Matas, Johnsson, & Ramos-Sanchez, 1997; Kittler & Messer, 2002; Luettin, 1997; Radova & Psutka, 1997; Ross & Jain, 2003; Sanderson & Paliwal, 2003, 2004; Sargin, Erzin, Yemez, & Tekalp, 2006; Wark, Sridharan, & Chandran, 1999a, 1999b, 2000; Yemez, Kanak, Erzin, & Tekalp, 2003). Humans easily accomplish complex communication tasks by utilizing additional sources of information whenever required, especially visual information (Lippmann, 1997). Face visibility benefits speech perception due to the fact that the visual signal is both correlated to the produced audio signal (Aleksic & Katsaggelos, 2004b; Barbosa & Yehia, 2001; Barker & Berthommier, 1999; Jiang, Alwan, Keating, E. T. Auer, & Bernstein, 2002; Yehia, Kuratate, & Vatikiotis-Bateson, 1999; Yehia, et al., 1998) and also contains complementary information to it (Grant & Braida, 1991; Massaro & Stork, 1998; Summerfield, 1979, 1987, 1992; Williams & Katsaggelos, 2002; Yehia, et al., 1998). Hearing impaired individuals utilize lipreading in order to improve their speech perception. In addition, normal hearing persons also use lipreading (Grant & Braida, 1991; Massaro & Stork, 1998; Summerfield, 1979, 1987, 1992; Williams & Katsaggelos, 2002; Yehia, et al., 1998) to a certain extent, especially in acoustically noisy environments. With respect to the type of information they use, ASR systems can be classified into audio-only, visual-only, and audio-visual. In AV ASR systems, acoustic information is utilized together with visual speech information in order to improve recognition performance (see Figure 1). Visual-only, and audio-visual systems utilize dynamics of temporal changes of visual features, especially the features extracted from the mouth region. Although AV-ASR systems are usually used, there are applications where V-only ASR systems can be employed achieving high recognition rates. Such include small vocabulary ASR (digits, small number of commands, etc.) and ASR in the presence of adverse acoustic conditions.

Complete Chapter List

Search this Book:
Reset