Recognizing Prosody from the Lips: Is It Possible to Extract Prosodic Focus from Lip Features?

Recognizing Prosody from the Lips: Is It Possible to Extract Prosodic Focus from Lip Features?

Marion Dohen (GIPSA-lab, France), Hélène Loevenbruck (GIPSA-lab, France) and Harold Hill (ATR Cognitive Information Science Labs, Japan & University of Wollongong, Australia)
Copyright: © 2009 |Pages: 23
DOI: 10.4018/978-1-60566-186-5.ch014


The aim of this chapter is to examine the possibility of extracting prosodic information from lip features. The authors used two lip feature measurement techniques in order to evaluate the “lip pattern” of prosodic focus in French. Two corpora with Subject-Verb-Object (SVO) sentences were designed. Four focus conditions (S, V, O or neutral) were elicited in a natural dialogue situation. In the first set of experiments, they recorded two speakers of French with front and profile video cameras. The speakers wore blue lipstick and facial markers. In the second set, the authors recorded five speakers with a 3D optical tracker. An analysis of the lip features showed that visible articulatory lip correlates of focus exist for all speakers. Two types of patterns were observed: absolute and differential. A potential outcome of this study is to provide criteria for automatic visual detection of prosodic focus from lip data.
Chapter Preview


For a spoken message to be understood (be it by machine or human being), the segmental information (phones, phonemes, syllables, words) needs to be extracted. Supra-segmental information, however, is also crucial. For instance, two utterances with exactly the same segmental content can have very different meanings if the supra-segmental information (conveyed by prosody) differs, as Lynne Truss (2003) nicely demonstrates:

A woman, without her man, is nothing.

A woman: without her, man is nothing.

Prosodic information has indeed been shown to play a critical role in spoken communication. Prosodic cues are crucial in identifying speech acts and turn-taking, in segmenting the speech flow into structured units, in detecting “important” words and phrases, in spotting and processing disfluencies, in identifying speakers and languages, or for detecting speaker emotions and attitudes. The fact that listeners use prosodic cues in the processing of speech has led some researchers to try to draw information from prosodic features to enhance automatic speech recognition (see e.g.Pagel, 1999; Waibel, 1988; Yousfi & Meziane, 2006).

Prosodic information involves acoustic parameters, such as intensity, fundamental frequency (F0) and duration. But prosodic information is not just acoustic, it is also articulatory, and in particular it involves visible lip movements. Although prosodic focus typically involves acoustic parameters, several studies have suggested that articulatory modifications—and more specifically visible lip and jaw motion—are also involved (e.g., Cho, 2005; Dohen et al., 2004, 2006; Erickson, 2002; Erickson et al., 2000; Harrington et al., 1995; Vatikiotis-Bateson & Kelso, 1993; Kelso et al., 1985; Lœvenbruck, 1999, 2000; Summers, 1987; De Jong, 1995). More specifically, correlates of prosodic focus have been reported on the lips, as will be outlined below. If visual cues are associated with prosodic focus, then one can expect that prosodic focus should be detectable visually.

Despite these facts, the addition of dynamic lip information to improve automatic speech recognition robustness has been limited to the segmental aspects of speech: lip information is generally used to help phoneme (or word) categorization. Yet visual information about the lips does not only carry segmental information but also prosodic information. The question addressed in this chapter is whether there are potentially extractable visual lip cues to prosodic information. If a visual speech recognition system is able to detect prosodic focus, it will better identify the information put forward by the speaker, a function which can be crucial in a number of applications.

Complete Chapter List

Search this Book: