How to Use Manual Labelers in the Evaluation of Lip Analysis Systems?

How to Use Manual Labelers in the Evaluation of Lip Analysis Systems?

Shafiq ur Réhman, Li Liu, Haibo Li
Copyright: © 2009 |Pages: 21
DOI: 10.4018/978-1-60566-186-5.ch008
(Individual Chapters)
No Current Special Offers


The purpose of this chapter is not to describe any lip analysis algorithms but rather to discuss some of the issues involved in evaluating and calibrating labeled lip features from human operators. In the chapter we question the common practice in the field: using manual lip labels directly as the ground truth for the evaluation of lip analysis algorithms. Our empirical results using an Expectation-Maximization procedure show that subjective noise in manual labelers can be quite significant in terms of quantifying both human and algorithm extraction performance. To train and evaluate a lip analysis system one can measure the performance of human operators and infer the “ground truth” from the manual labelers, simultaneously.
Chapter Preview


Lip image analysis (lip detection, localization and segmentation) plays an important role in many real world tasks, particularly in visual speech analysis/synthesis applications (e.g. application area mentioned by Mathew et al., 2002; Daubias et al., 2002; Potamianos et al., 2004; Cetingul et al., 2005; Wark et al., 1998; Chan et al., 1999; Chetty et al., 2004; Wang et al., 2004; Tian et al., 2000; Luthon et al., 2006, Réhman et al., 2007). Although impressive achievement has been made in the field (Wang et al., 2007; Caplier et al., 2008) (e.g. it is reported that the maximum mean lip tracking error has been reached to 4% of the mouth width (Eveno et al., 2004)); from an engineering viewpoint, automatic lip analysis today still presents a significant challenge to current capabilities in computer vision and pattern recognition. An important research problem is how to boost the technology development of lip analysis to achieve an order-of-magnitude improvement? An order-of-magnitude improvement in the performance of lip analysis will reach usual mean human performance: lip tracking with an accuracy of one-pixel for CIF lip images (position error around 0.5% of the mouth width).

Researchers from lip image analysis (especially lip-tracking and localization) should consider lessons from the work of face recognition vendor test (FRVT) (Phillips et al., 2000), which is a series of U.S. Government sponsored face recognition technology evaluations. Under its impressive effort in thirteen years, face recognition performance has improved by two orders of magnitude. To expect a similar order-of-magnitude improvement with lip analysis technologies, an urgent issue is to establish performance benchmarks for lip analysis, assess the advancement in the state-of-the-art in the technology, and identify the most promising approaches for further development.

Currently positive activities in the establishment of common test databases are in progress. Examples of publicly available databases include e.g. TULIPS 1.0 (Movellan, 1995), BioID (Jesorsky et al., 2001), (X) M2VTS (Messer et al., 1999), BANCA (Bailly-Bailliere et al., 2003), and JAFFE (Lyons et al., 1999). However, the evaluation criteria are not agreed on common ground yet. Lack of well accepted evaluation protocols makes it impossible even for experts in the field to have a clear picture of the state of the art lip analysis technology.

A common practice in the (quantitative) evaluation of a lip analysis technology is to collect reference examples with manual labeling by having the human operators examine a lip image on the computer screen and then use a mouse to indicate where they think the lips or the key points are. These manual labeling-marks of the lip area are used as the “ground truth” for the training and evaluation of lip analysis systems.

A critical question is: Can the manual labelers be served as the ground truth?

Complete Chapter List

Search this Book: