Multimodal Speaker Identification Using Discriminative Lip Motion Features

H. Ertan Çetingül; Engin Erzin; Yücel Yemez; A. Murat Tekalp

doi:10.4018/978-1-60566-186-5.ch016

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Multimodal Speaker Identification Using Discriminative Lip Motion Features

H. Ertan Çetingül, Engin Erzin, Yücel Yemez, A. Murat Tekalp

Source Title: Visual Speech Recognition: Lip Segmentation and Mapping

DOI: 10.4018/978-1-60566-186-5.ch016

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

This chapter presents a multimodal speaker identification system that integrates audio, lip texture, and lip motion modalities, and the authors propose to use the “explicit” lip motion information that best represent the modality for the given problem. The work is presented in two stages: First, they consider several lip motion feature candidates such as dense motion features on the lip region, motion features on the outer lip contour, and lip shape features. Meanwhile, the authors introduce their main contribution, which is a novel two-stage, spatial-temporal discrimination analysis framework designed to obtain the best lip motion features. For speaker identification, the best lip motion features result in the highest discrimination among speakers. Next, they investigate the benefits of the inclusion of the best lip motion features for multimodal recognition. Audio, lip texture, and lip motion modalities are fused by the reliability weighted summation (RWS) decision rule, and hidden Markov model (HMM)-based modeling is performed for both unimodal and multimodal recognition. Experimental results indicate that discriminative grid-based lip motion features are proved to be more valuable and provide additional performance gains in speaker identification.

Chapter Preview

Top

Introduction

Audio is probably the most natural modality to recognize the speech content and a valuable source to identify a speaker. However, especially under noisy conditions, audio-only speaker/speech recognition systems are far from being perfect. Video also contains important biometric information such as face/lip appearance, lip shape, and lip movement that is correlated with audio. Due to this correlation, it is natural to expect that speech content can be partially revealed through lip reading; and lip movement patterns also contain information about the identity of a speaker. Nevertheless, performance problems are also observed in video-only speaker/speech recognition systems, where poor picture quality, changes in pose and lighting conditions, and varying facial expressions may have detrimental effects (Turk & Pentland, 1991; Zhang, 1997). Hence robust solutions should employ multiple modalities, i.e., audio and various lip modalities, in a unified scheme.

Indeed in speaker/speech recognition, state-of-the art systems employ both audio and lip information in a unified framework (see Chen (2001) and references therein). However, most of the audio-visual biometric systems combine a simple visual modality with a sophisticated audio modality. Systems employing enhanced visual information are quite limited due to several reasons. On one hand, lip feature extraction and tracking are complex tasks, as it has been shown by few studies in the literature (see Çetingül (2006b) and references therein). On the other hand, the exploitation of this cue has been limited to the use of three alternative representations for lip information: i) lip texture, ii) lip shape (geometry), and iii) lip motion features. The first represents the lip movement implicitly along with appearance information that might sometimes carry useful discrimination information; but in some other cases the appearance may degrade the recognition performance since it is sensitive to acquisition conditions. The second, i.e., lip shape, usually requires tracking the lip contour and fitting contour model parameters and/or computing geometric features such as horizontal/vertical openings, contour perimeter, lip area, etc. This option seems as the most powerful one to model the lip movement, especially for lip reading, since it is easier to match mouth openings-closings with the corresponding phonemes. However, lip tracking and contour fitting are challenging tasks since contour tracking algorithms are sensitive to lighting conditions and image quality. The last option is the use of explicit lip motion, which are potentially easy to compute and robust to lighting variations.

Following the generation of different audio-visual modalities, the design of a multimodal recognition system requires addressing three basic issues: i) which modalities to fuse, ii) how to represent each modality with a discriminative and low-dimensional set of features, and iii) how to fuse existing modalities. For the first issue, speech content and voice can be interpreted as two different though correlated information existing in audio signals. Likewise, video signal can be split into different modalities, such as face/lip texture, lip geometry, and lip motion. The second issue, feature selection, also includes modeling of the classifiers through which each class is represented with a statistical model or a representative feature set. Curse of dimensionality, computational efficiency, robustness, invariance and discrimination capability are the most important criteria in the selection of the feature set and the recognition methodology for each modality. For the final issue, modality fusion, there exist different strategies: in the early integration, modalities are fused at data or feature level, whereas in the late integration, decisions (or scores) resulting from each expert are combined to give the final conclusion. Multimodal decision fusion can also be viewed from a broader perspective as a way of combining classifiers, where the main motivation is to compensate possible misclassification errors of a certain classifier with other available classifiers and to end up with a more reliable overall decision. A comprehensive survey and discussion on classifier combination techniques can be found in Kittler (1998).

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Multimodal Speaker Identification Using Discriminative Lip Motion Features

Abstract

Introduction

Complete Chapter List