Lip Contour Extraction from Video Sequences under Natural Lighting Conditions

Lip Contour Extraction from Video Sequences under Natural Lighting Conditions

Marc Lievin (Avid Technology Inc., Canada), Patrice Delmas (The University of Auckland, New Zealand), Jason James (The University of Auckland, New Zealand) and Georgy Gimel’farb (The University of Auckland, New Zealand)
Copyright: © 2009 |Pages: 41
DOI: 10.4018/978-1-60566-186-5.ch006


An algorithm for lip contour extraction is presented in this chapter. A colour video sequence of a speaker’s face is acquired under natural lighting conditions without any particular set-up, make-up, or markers. The first step is to perform a logarithmic colour transform from RGB to HI colour space. Next, a segmentation algorithm extracts the lip area by combining motion with red hue information into a spatio-temporal neighbourhood. The lip’s region of interest, semantic information, and relevant boundaries points are then automatically extracted. A good estimate of mouth corners sets active contour initialisation close to the boundaries to extract. Finally, a set of adapted active contours use an open form with curvature discontinuities along the mouth corners for the outer lip contours, a line-type open active contour when the mouth is closed, and closed active contours with lip shape constrained pressure balloon forces when the mouth is open. They are initialised with the results of the pre-processing stage. An accurate lip shape with inner and outer borders is then obtained with reliable quality results for various speakers under different acquisition conditions.
Chapter Preview


The mouth is a highly changeable (in morphology, topology, colour and texture) 3D object. It is composed of more than two hundred distinct muscles displaying different behaviour patterns depending on the language spoken. Fast and accurate tracking of lip movements has been a goal of the computer vision community for at least 30 years. Initially, most research focussed on multimodal speech analysis, where visual and audio information are processed together to improve speech and/or speaker recognition. More recently, there has been rapid growth in the number and diversity of multimedia applications requiring accurate lip movement parameters for modelling and animation.

One of the first lip-tracking systems was developed by Petajan in the mid 1980s. A camera fixed with respect to the head looks upward facilitating, via a binary thresholding of the image, the detection of nostrils (as the darkest blobs on the face). Anthropometric heuristics (regarding distances between eyes, nose, mouth and eyes) then help delineate the mouth area. Another thresholding process segments the aperture between the lips (as the darkest blob in the vicinity of the mouth region) and associated parameters (lip aperture, protrusion, stretching and jaw aperture) correlated to speech generation. Opening parameters of the mouth (surface, perimeter, height and aperture area) were combined with the output of a speech recognition module to recognise isolated letters (Petajan, 1985). A modified version saw the binary masks used to build a mouth shape database for viseme recognition (Petajan, Bischoff, Bodoff, & Brooke, 1988).

The first attempts to integrate video into speech analysis systems were made to resolve cases of greatly degraded speech, such as situations where “cocktail party” effects dominate or the signal to noise ratio is too low. Often the goal was to help increase the recognition rate of speech processing systems by detecting the utterance of separated letters (using closure of mouth detection) such as the so-called VCV (vowel-consonant-vowel) sequences.

The 1990s saw the advent of face tracking and recognition techniques based on colour, contour points, geometrical models (Yuille, Hallinan, & Cohen, 1992), and classification techniques. For most of these applications, the mouth region was the prime area of study as it carries most of the information conveyed by a talking face. Amongst others, algorithms can be classified as ``template matching’’ by temporal warping (Pentland & Mase, 1989), neural networks (Bregler & Konig, 1994), or Hidden Markov Models (Goldschen, Garcia & Petajan, 1994; Silsbee,1994; Guiard-Marigny, Tsingos, Adjoudani, Benoît, & Gascuel, 1996). Systems then evolved towards a more global approach of the face, integrating colours (Petajan & Graf, 1996; Vogt in Stork & M. Hennecke, 1996), and more specifically hue for face and lip detection and tracking (Hennecke, Prasad, & Stork, 1996; Lievin & Luthon, 1998; Coianiz in Stork & Hennecke, 1996).

Complete Chapter List

Search this Book: