Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design, and Implementation

Audio-Visual and Visual-Only Speech and Speaker Recognition: Issues about Theory, System Design, and Implementation

Derek J. Shiell (Northwestern University, USA), Louis H. Terry (Northwestern University, USA), Petar S. Aleksic (Google Inc., USA) and Aggelos K. Katsaggelos (Northwestern University, USA)
Copyright: © 2009 |Pages: 38
DOI: 10.4018/978-1-60566-186-5.ch001


The information imbedded in the visual dynamics of speech has the potential to improve the performance of speech and speaker recognition systems. The information carried in the visual speech signal compliments the information in the acoustic speech signal, which is particularly beneficial in adverse acoustic environments. Non-invasive methods using low-cost sensors can be used to obtain acoustic and visual biometric signals, such as a person’s voice and lip movement, with little user cooperation. These types of unobtrusive biometric systems are warranted to promote widespread adoption of biometric technology in today’s society. In this chapter, the authors describe the main components and theory of audio-visual and visual-only speech and speaker recognition systems. Audio-visual corpora are described and a number of speech and speaker recognition systems are reviewed. Finally, various open issues about the system design and implementation, and present future research and development directions in this area are discussed.
Chapter Preview

Introduction To Audio-Visual Recognition Systems

Modern audio-only speech and speaker recognition systems lack the robustness needed for wide scale deployment. Among the factors negatively affecting such audio-only systems are variations in microphone sensitivity, acoustic environment, channel noise and the recognition scenario (i.e., limited vs. unlimited domains). Even at typical acoustic background signal-to-noise ratio (SNR) levels (-10dB to 15dB), their performance can significantly degrade. However, it has been well established in the literature that the incorporation of additional modalities, such as video, can improve system performance. The reader is directed to the suggested readings at the end of this chapter for comprehensive coverage of these multi-modal systems. It is well known that face visibility can improve speech perception because the visual signal is both correlated to the acoustic speech signal and contains complementary information to it (Aleksic & Katsaggelos, 2004; Barbosa & Yehia, 2001; Barker & Berthommier, 1999; Jiang, Alwan, Keating, E. T. Auer, & Bernstein, 2002; Yehia, Kuratate, & Vatikiotis-Bateson, 1999; Yehia, Rubin, & Vatikiotis-Bateson, 1998). Although the potential for improvement in speech recognition is greater in poor acoustic conditions, multiple experiments have shown that modeling visual speech dynamics, can improve speech and speaker recognition performance even in noise-free environments (Aleksic & Katsaggelos, 2003a; Chaudhari, Ramaswamy, Potamianos, & Neti, 2003; Fox, Gross, de Chazal, Cohn, & Reilly, 2003).

The integration of information from audio and visual modalities is fundamental to the design of AV speech and speaker recognition systems. Fusion strategies must properly combine information from these modalities in such a way that it improves performance of the system in all settings. Additionally, the performance gains must be large enough to justify the complexity and cost of incorporating the visual modality into a person recognition system. Figure 1 shows the general process of performing AV recognition. While significant advances in AV and V-only speech and speaker recognition have been made over recent years, the fields of speech and speaker recognition still hold many exciting opportunities for future research and development. Many of these open issues on theory, design, and implementation and opportunities are described in the following.

Figure 1.

Block diagram of an audio-visual speech and speaker recognition system

Audio-visual and V-only speech and speaker recognition systems currently lack the resources to systematically evaluate performance across a wide range of recognition scenarios and conditions. One of the most important steps towards alleviating this problem is the creation of publicly available multi-modal corpora that better reflect realistic conditions, such as acoustic noise and shadows. A number of existing AV corpora are introduced and suggestions are given for the creation of new corpora to be used as reference points.

It is also important to remember statistical significance when reporting results. Statistics such as the mean and variance should to be used to compare the relative performance across recognition systems (Bengio & Mariethoz, 2004). The use of these statistical measures will be helpful in defining criteria for reporting system performance.

Complete Chapter List

Search this Book: