Visual Speech Recognition Across Multiple Views

Patrick Lucey; Gerasimos Potamianos; Sridha Sridharan

doi:10.4018/978-1-60566-186-5.ch010

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Visual Speech Recognition Across Multiple Views

Patrick Lucey, Gerasimos Potamianos, Sridha Sridharan

Source Title: Visual Speech Recognition: Lip Segmentation and Mapping

DOI: 10.4018/978-1-60566-186-5.ch010

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

It is well known that visual speech information extracted from video of the speaker’s mouth region can improve performance of automatic speech recognizers, especially their robustness to acoustic degradation. However, the vast majority of research in this area has focused on the use of frontal videos of the speaker’s face, a clearly restrictive assumption that limits the applicability of audio-visual automatic speech recognition (AVASR) technology in realistic human-computer interaction. In this chapter, the authors advance beyond the single-camera, frontal-view AVASR paradigm, investigating various important aspects of the visual speech recognition problem across multiple camera views of the speaker, expanding on their recent work. The authors base their study on an audio-visual database that contains synchronous frontal and profile views of multiple speakers, uttering connected digit strings. They first develop an appearance-based visual front-end that extracts features for frontal and profile videos in a similar fashion. Subsequently, the authors focus on three key areas concerning speech recognition based on the extracted features: (a) Comparing frontal and profile visual speech recognition performance to quantify any degradation across views; (b) Fusing the available synchronous camera views for improved recognition in scenarios where multiple views can be used; and (c) Recognizing visual speech using a single pose-invariant statistical model, regardless of camera view. In particular, for the latter, a feature normalization approach between poses is investigated. Experiments on the available database are reported in all above areas. This chapter constitutes the first comprehensive study on the subject of visual speech recognition across multiple views.

Chapter Preview

Top

Introduction

Recent algorithmic advances in the field of automatic speech recognition (ASR) together with progress in technologies such as speech synthesis, natural language understanding, and dialog modeling have allowed deployment of many automatic systems for human-computer interaction. Of course, these systems require highly accurate ASR to achieve successful task completion and user satisfaction. Although this in general is attainable in relatively quiet environments and for low- to medium-complexity recognition tasks, ASR performance degrades significantly in noisy acoustic environments, especially under conditions mismatched to training data (Junqua, 2000).

One possible avenue proposed for improving ASR robustness to noise is to incorporate visual speech information extracted from a speaker’s face into the speech recognition process – thus giving rise to audio-visual ASR (AVASR) systems. Indeed, over the past two decades, significant progress has been achieved in this field, and many researchers have been able to demonstrate dramatic gains in bimodal ASR accuracy, in line with expectations from human speech perception studies (Sumby and Pollack, 1954). Overviews of such efforts can be found in Chibelushi et al. (2002) and Potamianos et al. (2003), among others. In spite however of this progress, practical deployments of AVASR systems have yet to emerge. This we believe is mainly due to the fact that most research in this field has neglected addressing robustness of the AVASR visual front-end component to realistic video data. One of the most critical overseen issues is speaker head pose variation, or in other words the camera view-point of the speaker’s face.

Indeed, with a few exceptions reviewed in the Background section, nearly all work in the literature has concentrated on the case where the speaker’s face is captured in a fully frontal pose – a rather restrictive human-computer interaction scenario, a fact also made clear in Figure 1. For example, one potential AVASR application is speech recognition using mobile devices such as cell phones. Device placement with respect to the head does not allow frontal AVASR in this case. Another interesting scenario is that of in-vehicle AVASR. Due to frequent driver head movement, a frontal pose cannot be guaranteed, regardless of camera placement – for example at the rear-view mirror, the cabin driver-side column, or the instrument console. Other possibilities include the design of an audio-visual headset, where a miniature camera is placed next to the microphone in the wearable boom. Requiring frontal views of the speaker mouth means that the device may be designed to protrude unnecessarily in front of the mouth, creating headset instability and usability issues (Gagne et al., 2001; Huang et al., 2004). In contrast, placing the camera to the side of the face would allow a significantly shorter boom, hence resulting in a lighter and easier to use headset. Finally, an interesting scenario is this of AVASR during meetings and lectures inside smart rooms. There, pan-tilt-zoom (PTZ) cameras can track the meeting speaker(s) providing high resolution views. However, due to the camera fixed placements in space, frontal speaker views cannot be guaranteed. This latter scenario motivates our work. It is discussed in more detail later on in this book chapter, together with the audio-visual database collected in this domain to drive our research.

Figure 1.

Examples of practical scenarios where frontal AVASR is inadequate: (a) Driver data inside an automobile; (b) Mouth region data from a specially designed audio-visual headset; (c) Data from a lecturer captured by a pan-tilt-zoom camera inside a smart-room.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Visual Speech Recognition Across Multiple Views

Abstract

Introduction

Complete Chapter List