Lip Extraction for Lipreading and Speaker Authentication

Lip Extraction for Lipreading and Speaker Authentication

Shilin Wang (Shanghai Jiaotong University, China) and Alan Wee-Chung Liew (Griffith University, Australia)
DOI: 10.4018/978-1-60566-026-4.ch388
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In recent years, there is a growing interest in using visual information for automatic lipreading (Kaynak, Zhi, Cheok, Sengupta, Jian, & Chung, 2004) and visual speaker authentication (Mok, Lau, Leung, Wang, & Yan, 2004). It has been shown that visual cues, such as lip shape and lip movement, would greatly improve the performance of these systems. Various techniques have been proposed in the past decades to extract speech/speaker relevant information from lip image sequences. One approach is to extract the lip contour from lip image sequences. This generally involves lip region segmentation and lip contour modeling (Liew, Leung, & Lau, 2002; Wang, Lau, Leung, & ALiew, 2004), and the performance of the visual speech recognition and visual speaker authentication systems depends much on the accuracy and efficiency of these two procedures. Lip region segmentation aims to label the pixels in the lip image into lip and non-lip. The accuracy and robustness of the lip segmentation process is of vital importance for subsequent lip extraction. However, large variations caused by different speakers, lighting condition, or make-ups make the task difficult. The low color contrast between lip and facial skin, and the presence of facial hair, further complicate the problem. Given a correctly segmented lip region, the lip extraction process then involves fitting a lip model to the lip region. A good lip model should be compact, that is, with a small number of parameters, and should adequately represent most valid lip shapes while rejecting most invalid shapes. As most lip extraction techniques involve iterative model fitting, the efficiency of the optimization process is another important issue.
Chapter Preview
Top

Introduction

In recent years, there is a growing interest in using visual information for automatic lipreading (Kaynak, Zhi, Cheok, Sengupta, Jian, & Chung, 2004) and visual speaker authentication (Mok, Lau, Leung, Wang, & Yan, 2004). It has been shown that visual cues, such as lip shape and lip movement, would greatly improve the performance of these systems. Various techniques have been proposed in the past decades to extract speech/speaker relevant information from lip image sequences. One approach is to extract the lip contour from lip image sequences. This generally involves lip region segmentation and lip contour modeling (Liew, Leung, & Lau, 2002; Wang, Lau, Leung, & ALiew, 2004), and the performance of the visual speech recognition and visual speaker authentication systems depends much on the accuracy and efficiency of these two procedures.

Lip region segmentation aims to label the pixels in the lip image into lip and non-lip. The accuracy and robustness of the lip segmentation process is of vital importance for subsequent lip extraction. However, large variations caused by different speakers, lighting condition, or make-ups make the task difficult. The low color contrast between lip and facial skin, and the presence of facial hair, further complicate the problem. Given a correctly segmented lip region, the lip extraction process then involves fitting a lip model to the lip region. A good lip model should be compact, that is, with a small number of parameters, and should adequately represent most valid lip shapes while rejecting most invalid shapes. As most lip extraction techniques involve iterative model fitting, the efficiency of the optimization process is another important issue.

Top

Background

Accurate and robust lip region segmentation is of key importance for subsequent lip extraction. Techniques developed for lip segmentation are generally based on color space analysis, edge detection, Markov random field, or fuzzy clustering.

The color space analysis approach (Eveno, Caplier, & Coulon, 2001) identifies the lip pixels solely by their color information. However, color space-based methods are sensitive to poor color contrast and noise, and would give large segmentation error if the color distributions of lip and background regions overlap. The edge detection approach (Caplier, 2001) relies on the luminance or color edge information to detect the lip boundary. It works well when the speakers use lipstick or reflective markers. However, it would have difficulty dealing with unadorned lips. Markov random field (MRF) technique has also been used in lip region segmentation (Lievin & Luthon, 1999). MRF exploits local neighborhood information to enhance the robustness of the segmentation. However, MRF-based segmentation usually produces erroneous patches outside and inside the mouth region due to the presence of pixels with the wrong color distribution class.

Fuzzy clustering is another powerful tool for image segmentation. Fuzzy clustering attempts to assign a probability value to each pixel in order to minimize the fuzzy entropy. Since it is an unsupervised learning method, fuzzy clustering is capable of handling lip and skin color variation caused by make-up. Recently, we have proposed several novel fuzzy-clustering-based segmentation techniques that take the local (Liew, Leung, & Lau, 2000, 2003) and global spatial information (Leung, Wang, & Lau 2004; Wang, Lau, Liew, & Leung, 2007) into account to improve the segmentation performance. In our approaches, spatial information is seamlessly incorporated into the cost function and the optimization process.

Many techniques have been proposed for lip modeling and extraction, and they differ from each other in the following aspects:

Key Terms in this Chapter

Lip Region Segmentation: Lip region segmentation is a process by which all the pixels in the lip image are partitioned into two categories, that is, the lip pixels and the non-lip ones.

Fuzzy C-Means Clustering: Fuzzy C-means clustering is the most well-known partition-based clustering algorithm. The algorithm starts by choosing k initial centroids, usually at random. Then the algorithm alternates between updating the cluster membership value of each data point with different cluster centroids and updating the centroids based on the new clusters until convergence.

Cost Function: Also called objective function. A function associated with an optimization problem which determines how good a solution is.

Point-Driven Optimization: (for lip modeling) : Point-driven optimization is an iterative procedure to fit the lip model to the actual lip contour. In each iteration, each contour point is adjusted to a better position so as to increase the cost function. Compared with the other optimization methods, such as the parameter-driven optimization, the point-driven optimization approach usually requires less number of iterations and thus is more efficient.

Visual Speaker Authentication: Visual speaker authentication aims to perform speaker’s identity authentication based on the identity-relevant visual information, such as the dynamic visual information of the lip movement.

Lip Modeling: An approach to describe the lip region by a number of model parameters. Model definition, cost function formulation and the optimization process are the key issues of the lip modeling technique.

Normalization Process: Normalization is a process to refine the input data initial data so as to reduce various undesirable effects (such as translation, rotation in our application).

Local and Global Spatial Information: (for lip segmentation): In lip region segmentation, spatial information is incorporated to improve the segmentation performance by the color information only. The local spatial information is referred to the spatial continuity among neighbouring pixels while the global spatial information is referred to the spatial position of a pixel with respect to the lip region (i.e., inside the lip region, outside the lip region or around the lip-background boundary).

Lip Contour Extraction: A process to derive the lip contour information from the lip image.

Complete Chapter List

Search this Book:
Reset